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Summary 


This research program deals with the application of high-performance computing methods for the 
analysis of complete jet engines. We have initited this program by applying the two-dimensional 
parallel aeroelastic codes to the interior gas flow problem of a by-pass jet engine. The fluid mesh 
generation, domain decomposition and solution capabilities were successfully tested. We have then 
focused attention on methodology for the partitioned analysis of the interaction of the gas flow with 
a flexible structure and with the fluid mesh motion that results from these structural displacements. 
This is treated by a new ALE technique that models the fluid mesh motion as that of a fictitious mass- 
spring network. New partitioned analysis procedures to treat this coupled 3 -component problem 
are developed. These procedures involved delayed corrections and subcycling. Preliminary results 
on the stability, accuracy and MPP computational efficiency are reported. 




1. OVERVIEW 


The present program deals with the application of high-performance parallel computation for the 
analysis of complete jet engines, considering the interaction of fluid, thermal and mechanical 
components. The research is driven by the simulation of advanced aircraft propulsion systems, 
which is a problem of primary interest to NASA Lewis. 

The coupled problem involves interaction of structures with gas dynamics, heat conduction and heat 
transfer in aircraft engines. The methodology issues to be addressed include: consistent discrete 
formulation of coupled problems with emphasis on coupling phenomena; effect of partitioning 
strategies, augmentation and temporal solution procedures; sensitivity of response to problem 
parameters; and methods for interfacing multiscale discretizations. The computer implementation 
issues to be addressed include: parallel treatment of coupled systems; domain decomposition and 
mesh partitioning strategies; data representation in object-oriented form and mapping to hardware 
driven representation, and tradeoff studies between partitioning schemes with differing degree of 
coupling. 


2. STAFF 

Two graduate students have been partly supported by this grant. M. Ronaghi (U.S. citizen) began his 
graduate studies at Colorado on January 1993. He has completed a M.Sc. in Aerospace Engineering 
on May 1994 but plans to pursue his Ph. D. in Mechanical Engineering under a different research 
program that fits his background and experience better. 

U. Gu mas te (permanent U.S. resident) began his graduate studies at Colorado in the Fall semester, 
but worked in this project during June and July 1993 as an hourly research assistant. Mr. Gumaste 
has a B.Tech in Civil Engineering from the Indian Institute of Technology, Bombay, India. He 
completed his Ph. D. course requirement this semester with the transfer of graduate credit units 
from the University of Maryland. During this period he was partly supported as a Teaching Assistant. 
He plans to familiarize himself with our aeroelastic codes during May-June and will visit NASA 
Lewis for 6 weeks during July and August. 

The methods development for this project has been greatly benefited from the presence of three 
visiting Post-Docs. Dr. S. Lanteri conducted extensive experimentation on several computational 
algorithms for compressive viscous flow simulation on the iPSC-860, CM-5 and KSR-1 as reported 
in Appendix n. Dr. N. Maman has implemented “mesh matching” techniques that connect separately 
generated fluid and structural meshes. Dr. S. Pipemo has developed and evaluated implicit and 
subcycled partitioned analysis procedures for the interaction of structure, fluid and fluid-mesh 
motion. A new approach to augmentation of the governing semi-discrete equations that improves 
stability while keeping communications overhead modest was investigated. Results from this study 
are presented in Appendix I. 
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3. DEVELOPMENT OF PARTITIONED ANALYSIS METHODS 


The first parallel computations of a jet engine, presented in the first Progress Report, dealt with 
the fluid flow within a jet engine structure that was considered rigid and hence provides only 
guiding boundary conditions for the gas flow. When the structural flexibility is accounted for two 
complications occur: 

1. The engine simulation algorithm must account for the structural flexibility though periodic 
transfer of interaction information, and 

2. The fluid mesh must smoothly follow the relative structural motions through an ALE (Adaptive 
Lagrangian Eulerian) scheme. The particular ALE scheme selected for the present work makes 
use of Batina’s proposed pseudo-mechanical model of springs and masses overlaid over the 
fluid mesh. 

Research work during the period July 1993 through January 1994 was dominated by the treatment 
of two subjects: partitioned analysis of fluid-structure interaction (FSI) and accounting for fluid 
mesh motions. The partitioned analysis algorithm developed for the FSI problem is always implicit 
in the structure (because of its larger time scale of significant vibratory motions) and either explicit 
or implicit for the gas flow modeled by the Navier-Stokes equations. Subcycling, in which the 
integration stepsize for the fluid may be smaller than that used in the structure, was also studied. 

General Requirements 

The fundamental practical considerations in the development of these methods are: (1) numeri- 
cal stability, (2) fidelity to physics, (3) accuracy, and (4) MPP efficiency. Numerical stability is 
fundamental in that an unstable method, no matter how efficient, is useless. There are additional 
considerations: 

1. Stability degradation with respect to that achievable for the uncoupled fields should be min- 
imized. For example, if the treatment is implicit-implicit (I-I) we would like to maintain 
unconditional stability. If the fluid is treated explicitly we would like to maintain the same 
CFL stability limit. 

2. Masking of physical instability should be avoided. This is important in that flutter or diver- 
gence phenomena should not be concealed by numerical dissipation. For this reasons all time 
integration algorithms considered in this work must exclude the use of artificial damping. 

Stability vs. Communication-Overhead Tradeoff 

The degradation of numerical stability degradation is primarily influenced by the nature of infor- 
mation exchanged every time step among the coupled subsystems during the course of partitioned 
integration. A methodology called augmentation that systematically exploits this idea was devel- 
oped by Park and Felippa in the late 1970s. The idea is to modify the governing equations of one 
subsystem with system information from connected subsystems. The idea proved highly successful 
for the sequential computers of the time. A fresh look must be taken to augmentation, however, in 
light of the communications overhead incurred in massively parallel processing. For the present 
application three possibilities were considered: 
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No augmentation. The 3 subsystems (fluid, structure and ALE mesh) exchange only minimal 
interaction state information such as pressures and surface-motion velocities, but no information 
on system characteristics such as mass or stiffness. The resulting algorithm has minimal MPP 
communication overhead but poor stability characteristics. In fact the stability of an I-I scheme 
becomes conditional and not too different from that of a less expensive I-E scheme. This degradation 
in turns significantly limits the stepsize for both fluid and structure. 

Full augmentation. This involves transmission of inverse-matrix-type data from one system to an- 
other. Such data are typified by terms such as a a structure-to-fluid coupling-matrix times the inverse 
of the structural mass. Stability degradation can be reduced or entirely eliminated; for example I-I 
unconditional stability may be maintained. But because the transmitted matrix combinations tend 
to be much less sparse than the original system matrices, the MPP communications overhead can 
become overwhelming, thus negating the benefits of improved stability characteristics. 

Partial augmentation. This new approach involves the transmission of coupling matrix information 
which does not involve inverses. It is efficiently implemented as a delayed correction to the 
integration algorithm by terms proportional to the squared stepsize. The MPP communication 
requirements are modest in comparison to the fully-augmented case, whereas stability degradation 
can be again eliminated with some additional care. 

The partial augmentation scheme was jointly developed by S. Pipemo and C. Farhat. Its derivation 
is fully reported in Appendix I, which has been submitted for publication. The general methodology 
is first applied to a staggered FSI algorithm (staggering is a special form of partition) with common 
timestep, and then extended to cover subcycling. The reduced communications overhead has been 
recently verified on simple problems with preliminary tests on a iPSC860 Hypercube. Tests of this 
new scheme for more complex geometries will be carried out this summer. 

Algorithmic Effects of Dynamic Fluid Mesh 

The first one-dimensional results on the effect of a dynamic fluid mesh on the stability and accuracy 
of the staggered integration were obtained by C. Farhat and S. Pipemo and are also discussed in 
Appendix I. A doctoral student, M. Lesoinne, supported by NSF is extending these calculations to 
the multidimensional case. 

Appendix II contains a report by Charbel Farhat and S. Lanteri that summarizes recent experiences 
with the application of the two-dimensional Navier Stokes equations combined with the pseudo- 
mechanical network approach to dynamic mesh motion. Performance results obtained on the iPSC- 
860, KSR-1 and CM-5 parallel computers are reported for both fixed and moving fluid meshes. 

4. FUTURE WORK 

The present work undertaken since the renewal of the grant on March 1994 is focused on axisym- 
metrizing the fluid-structure interaction codes, in ’which the new partially-augmented staggered 
schemes and mesh motion techniques are implemented. We still need to introduce more physical 
effects in the gas flow, namely compression, diffusion and combustion. The modeling experience 
that Mr. Gumaste will acquire during his visit to NASA Lewis should help with the prosecution of 
these items. 
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APPENDIX I 


Partitioned Procedures for the Transient 
Solution of Coupled Aeroelastic Problems 

S. PlPERNO AND C. FARHAT 
Department of Aerospace Engineering Sciences 
and Center for Aerospace Structures 
University of Colorado at Boulder 
Boulder, CO 80309-0429 

January 1994 


Abstract 

In order to predict the dynamic response of a flexible structure in a fluid flow, the 
equations of motion of the structure and the fluid must be solved simultaneously. In this 
paper, we present several partitioned procedures for time-integrating this focus coupled 
problem and discuss their merits in terms of accuracy, stability, heterogeneous com- 
puting, I/O transfers, subcycling, and parallel processing. All theoretical results are 
derived for a one-dimensional piston model problem with a compressible flow, because 
the complete three-dimensional aeroelastic problem is difficult to analyze mathemati- 
cally. However, the insight gained from the analysis of the coupled piston problem and 
the conclusions drawn from its numerical investigation are confirmed with the numeri- 
cal simulation of the two-dimensional transient aeroelastic response of a flexible panel 
in a transonic nonlinear Euler flow regime. 


1. Introduction 

In order to predict the dynamic response of a flexible structure in a fluid 
flow, the equations of motion of the structure and the fluid must be solved si- 
multaneously. One difficulty in handling numerically the fluid/ structure coupling 
stems from the fact that the structural equations are usually formulated with 
material (Lagran gian ) co-ordinates, while the fluid equations are typically writ- 
ten using spatial (Eulerian) co-ordinates. Therefore, a straightforward approach 
to the solution of the coupled fluid /structure dynamic equations requires mov- 
ing at each time-step at least the portions of the fluid grid that are close to the 
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moving structure. This can be appropriate for small displacements of the struc- 
ture but may lead to severe grid distorsions when the structure undergoes large 
motion. Several different approaches have emerged as an alternative to partial 
re-gridding in transient aeroelastic computations, among which we note the ar- 
bitrary Lagrangian-Eulerian (ALE) formulation [1-3], the co-rotational approach 
[4,5], and dynamic meshes [6] (see also [28] for a review). All of these approaches 
treat a computational aeroelastic problem as a coupled two-field problem. 

However, the moving mesh itself can be formulated as a pseudo-structural 
system with its own dynamics [7], and therefore, the coupled transient aeroelastic 
problem can be formulated as a three- rather than two-field problem: the fluid, 
the structure, and the dynamic mesh. The semi-discrete equations governing this 
three-way coupled problem can be written as follows: 

+/■»■(,) = r‘{w( x ,t)) (i) 

- d 2 x ~dx ~ 

M ~dt T + D ~dt +Kx = Kc q 

where x is the position of a moving fluid grid point, W is the fluid state vector, 
A results from the finite element /volume discretization of the fluid equations, 
F£ is the converted vector of convective fluxes [7], F d is the vector of diffusive 
fluxes, q is the structural displacement vector, f int denotes the vector of internal 
forces in the structure, f ext the vector of external forces, M is the finite element 
mass matrix of the structure, M, D , and K are fictitious mass, damping, and 
stiffness matrices associated with the fluid moving grid and constructed to avoid 
any parasitic interaction between the fluid and its grid, or the structure and the 
moving fluid grid [7], and K c is a transfer matrix that describes the action of the 
motion of the structural side of the fluid/structure interface on the fluid dynamic 
mesh. For example, M = D = 0, and K = I corresponds to a rigid mesh motion 
of the fluid grid around an oscillating airfoil, and M — D — 0 corresponds to the 
spring-based mesh motion scheme introduced in [6]. 

Each of the three components of the coupled problem described by Eqs. (1) 
has different mathematical and numerical properties, and distinct software imple- 
mentation requirements. For Euler and Navier-Stokes flows, the fluid equations 
are nonlinear. The structural equations may be linear or nonlinear. The semi- 
discrete equations governing the pseudo- structural fluid grid system are linear. 
The matrices resulting from a linearization procedure are in general symmetric 
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for the structural problem, but they are typically unsymmetric for the fluid prob- 
lem. Morevoer, the nature of the coupling in Eqs. (1) is implicit rather than 
explicit, even when the fluid mesh motion is ignored. The fluid and the structure 
interact only at their interface, via the pressure and the motion of the physical 
interface. However, the pressure variable cannot be easily isolated neither from 
the fluid equations nor from the fluid state vector W. Consequently, the numeri- 
cal solution of Eqs. (1) via a fully coupled monolithic scheme is computationally 
challenging and software-wise unmanageable. 

Alternatively, Eqs. (1) can be solved via partitioned procedures [8]. This 
approach offers several appealing features including the ability to use well estab- 
lished discretization and solution methods within each discipline, simplification 
of software development efforts, and preservation of software modularity. Tra- 
ditionally, transient aeroelastic problems have been solved via the simplest pos- 
sible partitioned procedure whose cycle can be described as follows: a) advance 
the structural system under a given pressure load, b) update the fluid mesh ac- 
cordingly, and c) advance the fluid system and compute a new pressure load 
[9-12]. Occasionally, some investigators have advocated the introduction of a few 
predictor-corrector iterations within each cycle of this three-step staggered inte- 
grator in order to improve accuracy [13], especially when the fluid equations are 
nonlinear and treated implicitly [14]. 

The objective of this paper is the investigation of a broader range of par- 
titioned procedures for the transient solution of coupled aeroelastic problems, 
with particular attention to accuracy and stability issues, subcycling schemes, 
accuracy v.s. speed trade-offs, implementation on heterogeneous computing plat- 
forms, and inter-field as well as intra-field parallel processing. The complete 
three-dimensional aeroelastic problem is difficult to analyze because it mixes lin- 
ear and nonlinear operators, symmetric and unsymmetric matrices, explicit and 
implicit coupling, and can become physically unstable. Therefore, we begin our 
investigation with the design and analysis of partitioned integrators for a sim- 
plified one-dimensional aeroelastic problem that turned out to be a good model 
problem for the more complex aeroelastic systems that we wish to gain some 
intuition about. We focus on implicit time-integration schemes for the struc- 
tural field, because the aeroelastic response of a structure is often dominated 
by low frequency dynamics. However, we consider both implicit/implicit and 
explicit /implicit fluid/ structure partitioned procedures, with and without non- 
trivial prediction schemes. We discuss the computational and implementation 
aspects of each procedure and contrast their respective merits and shortcomings. 
Finally, we validate all the conclusions drawn from the investigation of the model 
problem with the simulation of the two-dimensional transient aeroelastic response 
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of a flexible panel in a transonic nonlinear Euler flow. 

2. A one-dimensional aeroelastic model problem with an Euler flow 

2.1. The piston problem: ALE formulation and linearization 

As a model problem, we consider the one-dimensional piston depicted in Fig. 
1. The equilibrium state of this coupled system is defined by a uniform pressure 
po inside and outside the piston chamber, a uniform gas density po> a zero flow 
velocity «o = 0> and a chamber length equal to /<>. The gas is assumed to be 
perfect, and the flow isentropic. Hence, the pressure p is function of the density 
p only and obeys: 

f[P _ „2 

dp 

where c denotes the sound speed. The cross sectional area of the chamber is 
assumed to be constant and equal to one. 



Flaid Flexible piston 



Fig. 1. The one-dimensional piston problem 

For this model aeroelastic problem, the one dimensional mass and momentum 
conservation equations for the fluid are: 

% + £(/») = o 

d. , d 2 ,, W 

dt^ + dx^ pU+P ^ = ° 

The linear dynamic equilibrium of the piston is governed by: 

mq + dq + kq = p(l Q + q) - po (4) 
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where m, d, k, and q denote respectively the piston mass, damping, stiffness and 
displacement. A dot superscript designates a derivative with respect to time. 

The boundary conditions for this coupled fluid/ structure problem are given 
by: 

H°) = 0 (5) 

«(*o + q) = q 


Eqs. (5) above state that the fluid velocity is zero at the fixed wall, and equal to 
the piston speed at the other end of the chamber. 

Clearly, the fluid flow has one moving boundary. Therefore, it is convenient 
to re-write Eqs. (3) with respect to a moving frame characterized by a velocity £ 
that may be different from the fluid velocity u and from zero. Let 3 = det(dx/d £ ) 
denote the jacobian of the frame transformation x — *■ £. The ALE form of Eqs. 
(3) goes as follows: 


W pu ^ + ~ & + = ° 


( 6 ) 


The above equations can be re-written in vector form as: 

i§i {jw)+ £ {nw)) = ° (7) 

where W and F c are respectively the fluid state and fluid convected flux vectors: 


W = ( p ) 

\pu J 

F c ( p(» — 0 A 

\(pu(u - 0 + p) ) 


( 8 ) 


The convection matrix associated with the above convected flux vector is: 

= W= (-3U 2u-i) w 

We consider the response of the aeroelastic coupled system to small pertur- 
bations around the equilibrium position (po> uo — 0, po> Co). First, we note that 

the fluid state vector at equilibrium Wo = ^ ^ satisfies: 

j^ JW ° )+ ^ (nWo)) = 0 (10) 
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and the convection matrix at equilibrium is: 


MO) = (° J) (ii) 

Then, we linearize the convected flux vector around Wq: 

F C (W) = F C (W 0 ) + J o (0)6W-6£Wo (12) 

Finally, from Eqs. (7-12) it follows that the linearized fluid flow equations are: 

~(JW) + ^(M0)SW - fjw.) = 0 (13) 


2.2. Spatial discretization: finite volume formulation and upwinding 


The one-dimensional chamber region is discretized into N grid points and N 
cells (Fig. 2). Integrating Eqs. (13) between and and using a finite 
volume formulation with upwinding leads to: 


Axj 6W + (F/ + i - F/_i) = 0 

where 


A xj 


3j+ 1 ~ 3j- 1 

2 


(14) 


Here, F c , x and F? , are the convected numerical fluxes [7] associated with the 
classical +/— flux splitting: 


f; +i = Jf(0)SWj + -tii+i 2A±J?°tt2l 


and Jq(0) and J 0 (0) satisfy Jq(0) = Jq( 0) + J 0 (0) and are given by: 



( 16 ) 



_cell_ 

• • • r • — • 

j-i j+i 


Fig. 2. Discretization of the one-dimensional flow 
Substituting Eqs. (15) into Eqs. (14) gives: 

A Xj 8W - J+(0)6Wj- ! + (J 0 + ( 0) - J^mSWj + Jo(0)8W j+1 




(W 0j ±W oj 


i+1 




Wi-.+iyo,) _ 


= 0 


(17) 


2.3. Transpiration 

From Fig.2 and the second of Eqs. (5), it follows that: 

ti K+i = « (18) 

All other ALE grid velocity perturbations 6£j, j = 1, AT are arbitrary. In order 
to simplify the piston problem from a three- to a two-field coupled problem, we 
assiimp that these velocity perturbations are small compared to the unperturbed 
sound speed c<j. Consequently, Eqs. (17) become: 

Ax, 8W - J+WSWj-i + ( J 0 + ( 0) - Jo(0))8Wj + J*(0)6W j+1 =0 (j? N) 
Ax n 6W- J 0 + (0)^-i+(J 0 + (0)-^"(0))^-W w = 0 

(19) 

The quantity qWo N = f corresponds to a “transpiration” flux. The reader 

ran check that except for the presence of this transpiration flux, Eqs. (19) are 
identical to the semi-discrete linearized equations governing a one-dimensional 
fluid flow with fixed boundaries. 
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2.4 . The semi-discrete aeroelastic model problem 
We define the structure state vector as: 


( 20 ) 


-(:) 

Using Q, the structural Eqs. (4) can be re-written as: 

Q = (_i. d 1 Q + ( p(t 0 +^)- P0 ^ 

For the linearized piston problem, the forcing term ^ P (f 0 +^)_ P0 ^ ^ ^ ^ ^ 

Hence, Eq. (21) 


( 21 ) 


is a linear function of the fluid state vector SW = 
can be re-written as: 


_( 6p 

\S(pu) i 


Q = DQ + C6W 
where 

D = (A A) 


( 22 ) 


Also, Eqs. (19) can be re-arranged in matrix form as follows: 

SW = A6W + BQ (23) 


where B is the matrix induced by the transpiration flux qWa N . 

In summary, the semi-discrete coupled system associated with the one- 
dimensional aeroelastic model problem is completely defined by: 


V Q ) ~ 

(A B\ f6W\ 

\c d){q) 

(6W\ 

(6W\ 

V Q J (t=o) 

\QJo 


(24) 


In the remainder of this paper, we focus on developing, analyzing, and val- 
idating partitioned procedures for solving Eqs. (24). Because the aeroelastic 
response of a structure is often dominated by low frequency dynamics, we con- 
sider only implicit schemes for time-integrating the structural field. However, 
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we consider both explicit and implicit time-integrators for advancing the fluid 
field, as both approaches are popular in computational fluid dynamics. Elegant 
methods for analyzing the stability of partitioned integrators with and without 
subcycling can be found in [15,16]. However, both of these references deal with 
symmetric fields only. We have found that the extension of these analysis meth- 
ods to mixed symmetric/misymmetric problems such as those described by Eqs. 
(24) is diffi cult — if not impossible — which has also motivated us to investigate 
first, a simplified aeroelastic model problem. 

REMARK 1. Eqs. (24) are also valid for two-dimensional and three-dimensional 
linearized aeroelastic problems. 


3. Mathematical preliminaries 
3.1. Physical v.s. numerical instabilities 

Transient fluid (gas)/structure interaction problems have one particularity: 
they possess a wide variety of self-excited vibrations and instabilities. For exam- 
ple, at speeds of flow somewhat above the critical flutter speed [17], the structural 
system extracts energy from the flow system and a small accidental disturbance 
of the air foil can serve as a trigger to initiate an oscillation of great violence. 
Physical instabilities can also occur in the linear regime. An example of a linear 
dynamic instability is vibrations due to von K ar m a n vortices [18]. If the frequency 
of the structure loading caused by the vortices is close or equal to the natural 
firequency of the body, then a resonance effect is present and large amplitudes of 
vibrations result. Therefore, when it comes to analyzing the numerical stability 
of a proposed algorithm for time-integrating fluid/structure interaction problems, 
it is essential to consider the case where the coupled system is physically stable 
— that is, when Eqs. (1) or even Eqs. (24) have a solution that does not grow 
indefinitely in time. 

The objectives of this section are to present a mathematical framework for 
the stability analysis of the solution of the semi-discrete Eqs. (24), and to show 
that for the aeroelastic model problem introduced and discretized in Section 2, 
these equations have always a stable solution. Hence, the fluid/structure interac- 
tion model problem presented in this paper is also a good problem for analyzing 
partitioned time-integrators with particular reference to numerical stability. 

REMARK 2. Intuitively, one can expect Eqs. (24) to admit a stable solution for 
the aeroelastic model problem, because the fluid flow is confined inside a closed 
chamber and therefore has a limited amount of energy to exchange with the 
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piston, and the piston is not excited by any other external and time-dependent 
force. However, the analysis framework presented in Section 3.2 is interesting 
because it also reveals a numerical property of the coupled model problem that 
turns out to be important for the design of an unconditionally stable partitioned 
procedure for solving Eqs. (24). 

3.2. Analysis framework 

Let X, M, and Sp (Af) denote respectively a real vector, a real matrix that 
is diagonalizable in the complex space C, and its set of complex eigenvalues. We 
focus our attention on the linear system of ordinary differential equations (ODE): 

X = MX (25) 


First, we introduce two definitions. 

DEFINITION 1. We will say that M is “stable” if and only if: 

a) M is diagonalizable in C 

b) VA € Sp (M), ft (A) < 0 

la b), ft (A) designates the real part of A. 

DEFINITION 2. We will say that the real symmetric positive definite 
(RSPD) matrix Em is an “energy matrix” for M if and only if EmM is non- 
positive, that is: 

VX, X*E m MX < 0 


Here, the superscript t designates the transpose operation. 

Next, we state and prove four theorems. 

THEOREM 1. An RSPD matrix Em is an energy matrix for M if and only 
if: 

VX solution of X = MX, -^-(^-X t E M X) < 0 (26) 


Proof. From Eq. (25), it follows that ^(^EmX) = X t E M MX. Hence, 
’Jti\X t EMX) < 0 if and only if Em is an energy matrix for M. 

THEOREM 2. If M — P~ 1 £lP denotes the diagonalization of a stable matrix 
M, then an energy matrix for M is given by: 

E m = P*P + P*P (27) 
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where P is the complex conjugate matrix of P. 

Proof. Clearly, Em = P P + P*P is real symmetric. For all real vectors 
X, X*E m X = 2\PX\ 2 is positive and equal to zero only if X = 0 because P is 
non singular. Finally since M = M, we have: 

x*e m mx = x ] 'p'opjr + x t p t ?p~ l sipx 

= 2R (X'P'fiPAT) =2^|(PX) i | 2 £(tt < ) < 0 


which completes the proof of THEOREM 2. 

THEOREM 3 (reciprocal of THEOREM 2). Let M be a real matrix that is 
diagonalizable in C. If there exists an energy matrix Em for M, then M is stable. 

Proof. Let X = R + il 0, where i 2 — —1, denote a complex eigenvector 
of M associated with an eigenvalue A. If Em is an energy matrix for M, we have: 


0 > R*E m MR = R t E M '&{MX) = R'Em^^X) = (A)P - $ (A)/] 

0 > VEmMI = PEWS (MX) = PP M S( XX) = I'EmI® (A)J + 5 (A)P] 

(28) 

where 5 (A) designates the imaginary part of A. Adding the two inequalities in 
(28) and exploiting the symmetry of Em leads to 3? (A)[P*Pa/P + PEmI] < 0. 
Since Em is RSPD, it follows that (A) < 0, which completes the proof of 
THEOREM 3. 

THEOREM 4. If A and D are two real stable matrices with energy matrices 
E A and Ef), then: 


E a B + ( EdCY = 0 


M 


■a s)‘ 


is a stable matrix 


Proof. The matrix Em = ^ ^ ^ is RSPD. If X = f ^ is a real 


E a 0 
0 Ed 

vector satisfying X = MX, we have: 


j^X'EmX) = SW*E a ASW + SW'EaBQ + Q'EdCSW + Q*E d DQ 

= 8W t E A A6W + 8W\ [E A B + ( E d C )‘] Q + Q t E D DQ 
= SW'EaASW + Q'EdDQ < 0 
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which in view of THEOREM 1 implies that Em is an energy matrix for M. From 
THEOREM 3, it follows that M is stable. 

Theorems 1-3 set the stage to THEOREM 4 which has a nice physical in- 
terpretation. An uncoupled fluid system is physically stable: it does not produce 
energy. If D is non-positive, an uncoupled structural system is also stable. For 
a coupled fluid/structure system, E^B + ( EqC )* = 0 simply expresses that the 
energy extracted from one system is equal to that injected into the other one. 
Hence, THEOREM 4 merely states that a coupled system where the energy is 
exactly conserved is a physically stable system. 

3.3. Physical stability of the model problem 


Consider again the .aeroelastic model problem introduced in Section 2. As- 
suming a constant mesh size Ax, the linearized energy Sfiuid. °f the discretized 
fluid system can be written as: 


, _ M + *M 

£fUH - 2p 0 + 2 

and its perturbed state vector is SW = (5pi,£(/>o«i)> • • • 
fore, Ea can be constructed for this system as follows: 


) (29) 

,SpN, Hpo^n)) 1 • There- 


E a 



0 


0 

Ax 

Po 


0 \ 


V o 


S* 0 

Po I 

0 W 

PO 


(30) 


Using Eqs. (19), the reader can verify through tedious but elementary calculations 
that Ea is an energy matrix for A induced by the spatial flux splitting. 


For the piston, the state vector is Q = 


and the energy is S p i 9 ton = 


(k q 2 4- mq 2 )/ 2. Hence, for this structural system Ed can be written as: 



(31) 


Proving that Ed is an energy matrix for D is straightforward. Using the second 
of Eqs. (22) we have: 


EdD 


k 0 \ / 0 -1 \ / 0 k \ 

0 m ) W " 1-* ~d) 
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which shows that EqD is negative when the damping d is positive, and therefore 
proves that Ed is an energy matrix for D. 

The transpiration term in the last grid cell and the pressure force on the 
piston generate respectively the matrices: 


(° 

0 ••• 

0 

and C = 

(° 

0 ••• 

0 

n) 

v° 

0 ••• 

Ax 

0 ) 

V» 

0 ••• 

m 

0 ) 


From Eqs. (30-32) and after some algebraic manipulations, it follows that 
EaB+IEdC)* = 0, which shows that the semi-discrete aeroelastic model problem 
introduced in Section 2 admits a stable solution. Therefore, staggered algorithms 
for time-integrating Eqs. (24) can be analyzed for unconditional stability. 


4. A family of implicit /implicit partitioned procedures 

4-1. Unconditionally stable staggered time-integrators 

Here, we present a family of unconditionally stable implicit/implicit stag- 
gered algorithms for solving the model Eqs. (24) whose “design” is based on the 
following 4-step methodology. 

Step III. Predict the structural field using the value computed at t n = nAt: 

Q p - Q n 

Step 112. Advance the fluid system using the trapezoidal rule: 

* SW n+1 = 6W n + AtA6W n+ > + AtBQ p 
where 

+4 + 

2 

Step 113. Advance the structural system using an implicit time-integrator se- 
lected from the so-called generalized trapezoidal family of methods [19], 
and a midpoint value of the previously updated fluid state <5 : 

= Q n + AtDQ™+° + AtC6W n+ * 

where 

Qn+a = (1 _ a )Q" + aQ n+1 a G ]0, 1] 
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Step 114. Correct the equations giving 6W n+1 and Q n+1 to enforce uncondi- 
tional stability of the implicit/implicit staggered procedure: 


6W n+1 = 6W n+1 + [6W n+lC ] 

Qn+l = qZ+T + [Q«+i e ] 


It should be emphasized that the above steps describe the design process of a 
solution methodology and not the computer implementation of a time-integration 
algorithm. In particular, neither the fluid nor the structural fields should be solved 
until the correction terms [<$W n+lC ] and [Q n+lC ] are first specified. 

The trapezoidal rule is unconditionally stable and second-order accurate 
when applied to the solution of the uncoupled linearized fluid system (Eqs. (17)). 
For o; > 1/2, the generalized, trapezoidal, family of methods is unconditionally 
stable when applied to time-integrate the uncoupled structural problem. For 
a € ]0, 1], these methods are first-order accurate, except for a = 1/2, in which 
case the corresponding scheme is second-order accurate. The correction terms 
[<5W n+1 ] and ] should be computed to ensure the unconditional stability 

of the resulting implicit/implicit staggered solution procedure. 

THEOREM 5. For a > 1/2, [6W n+lC ] = ±A t 2 BC6W n +± and [ Q n+1 ‘ } = 
(1 — a)A t 2 DC 6W n+ i , the implicit/implicit staggered time-integrator defined by 
Steps II1-II4 above is unconditionally stable. 

Proof. For [5W n+lC ] = ±A t 2 BC6W n+ ± and[Q n+lC ] = (l-a)A i 2 DC6W n +t, 
the proposed implicit/implicit staggered solution algorithm for solving Eqs. (24) 
becomes: 


(33) 


The above partitioned procedure can also be written as: 


SW n+1 = SW n + A tA6W n+ ± + A tBQ n + -A t 2 BC6W n+ * 

2 

Q n+1 = Q n + AtDQ n+a + AtC8W n+ * + (1 - a)At 2 DC8W n+ * 

<* 6 [±, 1 ] 


Q* 


Q** 

6W n+ 1 - SW n 

At 

Qn+1 _ 

At 


Q n + ^-C6W n+ i 
Q n + A tC6W n+ ± 
A6W n+ * + BQ* 

D((i - at)Q" + acQ"* 1 ) 


( 34 ) 
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Using the energy matrices Ea and Ed, we define the system energy as: 


Sw,q = \SW'E A SW + \q'E d Q 


( 35 ) 


From Eqs. (33-35) and DEFINITION 2 and after some algebraic manipulations, 
it follows that: 


£w+t,Q» = €sw* § q* + At6W n +¥ E A A6W n +> + AtQ*' B*E A SW n +t 
£sW n + l ,Q n — £sw n t Q n + A tQ* B t 

Ssw*+',Q** = ^ W n+t iQ n + + y^ n+ 'C^D^ n+ ' 

£sw n + l ,Q** = ^5W n + 1 ,Q n + C*£/£>Q* 

^ B +i,Q n +i = ^6lV n + l ,Q** “I" At((l “ °0Q + OtQ 71 **) 1 EdD((1 — 0:)Q + ) 

A *2 . 

+ (1 - 2a)=H(l - cx)Q* m + aQ n+1 ) D‘E d D((1 - a)Q~ + aQ n+1 ) 
=> For a > — , £&w n + l ,Q n+l — £sw n + l ,Q** 

(36) 

which also implies that: 

£«*"•+! ,Q-+t < £W»,q» + AtQ*‘ 4- At5W rn+ *' C'EdQ* 

=> £{W»+i Q n + 1 £: £tfW n ,Q" + At^W^ n+ 2 [Ea-® + (Ed^O^Q* 

(37) 

Finally, since the aeroelastic model problem satisfies EaB + (EdC) 1 = 0, it 
follows that: 

Ssw n + l ,Q n + l — £sw n ,Q n (38) 


which shows that the numerical energy of the system does not increase in time, 
and therefore the partitioned solution procedure (33) is unconditionally stable. 

T HE OREM 6. The implicit /implicit unconditionally stable partitioned procedure 
defined by Eqs. (33) is first-order accurate. 


to: 


Proof. Expanding the various terms in Eqs. (33) around the time nAi leads 

SW n = A6W n + BQ n + O (At) 

Q n = CSW n + DQ n + O (At) 
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Comparing the above equations with Eqs. (24) completes the proof of this theo- 
rem. Clearly, second-order accuracy would require a more sophisticated predictor 
in Step III. 

It is interesting to note that with the partitioned solution methodology de- 
scribed in Steps II1-II4, we are able to achieve unconditional stability without 
resorting to an augmentation technique [8,20]. Augmentation based schemes are 
often expensive and cumbersome to implement because they require forming and 
factoring the product of the independent field and coupling matrices. The stag- 
gered time-integrator described in Eqs. (33) requires only one additional sparse 
matrix-matrix product to form BC. 

REMARK 3. Under some mild assumptions, the condition EaB + ( EpC )* = 
0 can be shown to hold for two-dimensional and three-dimensional linearized 
aeroelastic problems. In that case, the staggered time-integrator described in 
Eqs. (33) is also unconditionally stable for these two-dimensional and three- 
dimensional problems. 

4-2. Subcycling 

The fluid and structure fields have often different time scales. For problems 
in aeroelasticity, the fluid flow usually requires a smaller temporal resolution 
than the structural vibration. Therefore, if the unconditionally stable staggered 
algorithm (33) is used to solve a coupled fluid/ structure problem, the coupling 
time-step A t c will be typically dictated by the time-step A tp that guarantees a 
certain accuracy in the flow solution, rather than the time-step Ats > Atp that 
meets the accuracy requirements of the structural field. 

Using the same time-step At c in both fluid and structure computational 
kernels presents only minor implementational advantages. On the other hand, 
subcycling the fluid computations with a factor ng/F = Ats/AtF can offer sub- 
stantial computational advantages, including: 

• savings in the overall simulation CPU time, because in that case the struc- 
tural field will be advanced fewer times. 

• savings in I/O transfers and/or communication costs when computing on a 
heterogeneous platform, because in that case the fluid and structure kernels 
will exchange information fewer times. 

However, the computational advantages highlighted above are effective only 
if subcycling does not restrict the stability region of the staggered algorithm 
to values of the coupling time-step At that are small enough to offset these 
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advantages. For example, consider the following fluid-subcyded version of the 
unconditionally stable staggered time-integrator given in Eqs. (33): 

6W n+lW = 6W n 

{ 

For k = 0, ns/F ~ 1 

^n+i<*+ 1 ) = 6 W n+ 1<‘> + ( AtA + BC)SW n+l(k+ ** + A tBQ n (39) 

} 

sw n+1 = 8w n+llns,F) 

Q n + l = Q n + AtDQ n+a + AtCSW 9 ** + (1 - a)At 2 DC8W n+ ± 

This algorithm implements the simplest possible subcycling scheme and is of- 
ten used in many applications. Unfortunately, the reader can easily check that 
the above fluid-subcycled partitioned procedure (39) is no longer unconditionally 
stable. Next, we present an improved subcycling approach that preserves the 
unconditional stability of the partitioned procedure (33). 

THEOREM 7. For a > 1/2, the following fluid-subcycled version of the 
staggered time-integrator given in Eqs. (33) is unconditionally stable: 

$W n+1<#) = SW n 
X i0) = Q n 

{ 

For k = 0, ..., «s/F ~ 1 

£W n+1< * +l) = SW n+lW + (A tA + ^ BC)SW n+l(k+i) + A tBX {k) 

£ 

X (k+1) = X {k) -1- AtC6W n+lli '* ) 

} 

8W n + 1 = 8W n + x(ns ' F) 

Q n+1 = x (ns / ,,) + n S / F &tD{{l - a)X (ns/F> + aQ n+1 ) 

« S [=. 1] 

(40) 

Proof. The proof of this theorem is siipilar to that of THEOREM 5 and uses 
the system energy defined in Eq. (35). Using Eqs. (35,40) and DEFINITION 2 
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one obtains: 


^5^»+i( 0 ) ^(o) — £fW n ,Q n 

£ ivr-t.<*f>,x(») ^ + A tsw" + ' uH '' E A B(XW + yC(r rt,|,t *’) 

=#• ^ £w» + .(»),x(*) + a «^W"” +1< * +1> 'eaBA-<‘ + 1> 

- £ <lv . +l <.«), x „, + A «W” +1< * +i,, C‘£ D (X<‘> + ^CW +,< * +i> ) 

=*• = £ w .«<*«i x<>i +AMW'” +1< * +1, 'c'B d X <1+ * ) 

£«fV»+>,Q»+i = £ SWn +^s,r) xn+1 (-s/p) 

+ n S / F At((l - a)X^ ns t^ + aQ n+1 fE D D(( 1 - + aQ n+1 ) 

A< 2 t 

+ (1 - 2a)— ((1 - a)X (ns / F) + aQ n+1 ) Z7‘ jE7 jD X>((l - a)X (ns / r) + aQ n+1 ) 

£» 

— ^ Fora > — , ^tv n + 1 ,Q n + 1 ^ ) 

(41) 

The above inequalities also imply that: 


k = n s / F — l ( 

5 W n+i lQ »+i < f 6l v",Q"+At SW n+l(k+ ^ y [E A B + (E D Cy]X (k+ ^ 

*= =0 

(42) 

Finally, since the aeroelastic model problem satisfies EaB + ( EdC )* = 0, it 
follows that: 

£w+»,Q«+i < £w»,Q» (43) 

which shows that the numerical energy of the system does not increase in time, 
and therefore the partitioned and fluid-subcycled solution procedure (40) is un- 
conditionally stable. 

THEOREM 8. The implicit/implicit unconditionally stable partitioned and fluid- 
subcycled procedure defined by Eqs. (40) is first-order accurate. 

Proof. The proof of this theorem is similar to that of THEOREM 6. 

The subcycling approach advocated in Eqs. (40) preserves the computational 
advantages of subcycling. At each stage, the evaluation by the fluid solver of 
the correction term does not require neither advancing the structural 

state vector, nor exchanging information with the structural solver. Moreover, 
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updating and BX^ requires only two sparse matrix-vector products and 

therefore is relatively inexpensive. 

The interpretation of the role of the correction term goes as fol- 

lows. In order to solve Q = DQ + CSW between t n and f n+1 , the struc- 
ture kernel must receive from the fluid module the best possible approxima- 
tion of the coupling quantity f t ” +1 CSWdt. In the fluid-subcycled partitioned 

procedure (39), this integral is approximated by A tCSW n+ $ , which guaran- 
tees a certain accuracy but does not warrant unconditional stability. On the 
other hand, the strategy consisting in approximating // n+1 CSWdt via updat- 
ing = X^ + A*CW n+1< + ^ ) and replacing in the procedure (39) the 

“frozen” A tBQ n by the updated quantity A tBX^ not only provides a better 
coupling accuracy, but also 'preserves the unconditional stability of the original 
non subcycled partitioned procedure (33). 

The implications of the above results and discussion on the staggered and 
subcycled solution of more complex aeroelastic problems can be formulated as 
follows. When the fluid field is subcycled and updated ahead of the structural 
field, then: 

• the motion of the moving fluid boundary induced by the structural deforma- 
tion should not be completely absorbed dining the first fluid sub cycle and 
“frozen” during the remaining ones. Rather, this induced motion should be 
distributed among ail subcycle stages via a careful interpolation scheme. 

• after all fluid subcycles are completed, the mean value rather than the final 
value of the pressure field must be transmitted to the structure. 

4-3. Examples 


Here, we illustrate the numerical properties of the family of implicit/implicit 
staggered procedures presented in Sections 4. 1-4.2 with the solution of the aeroe- 
lastic model problem (24). We consider the case where a = 1/2 and the structure 
is undamped (d = 0). First, we introduce the non-dimensional variables: 
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and rewrite Eqs. (19) and (4) in non-dimensional form as follows: 


A xj SW'j - It ( 0 )^--! + (ft (0) - J 0 mtWj + J 0 (0 )SW j+1 =0 (j * N ) 
q" + u]q = ^(SpN - 1 ) 

0 - -0 


(45) 

Next, we discretize the piston chamber into 21 grid points and N = 20 finite 
volume cells, and set the non-dimensional parameters to u>j = 3.03 x 10 -4 and 
77 i = 30.77. We consider the following initial conditions: 

6p(t = 0) = Spu(t = 0) = 0 
q{t = 0) = 0 q(t = 0) = 1 

and solve the coupled equations (45) using eight time-steps varying between At = 
lxCFL and At = 128xCFL, where CFL = Iq/(Ncq) is computed with respect to 
the uncoupled fluid problem. The obtained non- dime nsional piston displacement 
q/lo and fluid pressure in the cell in contact with the piston (p/poc%)20 are depicted 
in Fig. 3-4 for the case without subcycling. 

Clearly, the results reported in Fig. 3-4 highlight the unconditional stability 
of the family of implicit /implicit staggered procedures presented in Sections 4.1- 
4.2. Stable responses are observed for all time-steps, and accurate results are 
obtained for both the piston displacement and fluid pressure for time-steps as 
large as At = 32 x CFL. 

The previous computations are repeated using the fluid-subcycled staggered 
time-integrator (40), 2xCFL < Atp < 4xCFL, and several subcycling factors 
1 < n s/F — 32. The corresponding results (Fig. 5-8) confirm numerically the 
unconditional stability proved mathematically in THEOREM 7. Note however 


where 
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that large subcycling factors ns/F introduce a spurious phase shift in the initial 
stages of the coupled computations that can ruin the accuracy of the response 
history. Both amplitude and phase errors can also be observed in that case. This 
suggests that an adaptive time-stepping strategy is needed in order to resolve 
better the initial response of the coupled system. 


21 




Fig. 3. Computed non-dimensional piston displacement (without subcycling) 

« 


22 







Fig. 4. Computed non-dimensional fluid pressure (without subcycling) 
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I Scheme without nbcyding - dt=2xCFL 


nSchaaew ii fa5nfacyding-A=2iCFL 



0 82 165 248 330 0 82 165 24* 330 

Ns/W tislt4 


Fig. 5. Computed non-dimensional piston displacement (At = 2xCFL and with subcycling) 
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n Sdieme witbooL sabcyclmg - dt=2xCFL 



n Scheme with subcycling - dt=2xCFL 



Fig. 6. Computed non-dimensional 


D Scheme with subcyding-dt=2xCFL 



n Scheme with subcyding-<k=2iCFL 



pressure (At = 2xCFL and with subcycling) 
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nSchra without sabcyciing-dt=4xCFL n Scheme wi4sobc)ding-dca4xCFL 



0 82 165 24* 330 082 165 248330 

Ns/W Ns/M 


Fig. 7. Computed non-dimensional piston displacement (At = 4xCFL and with subcycling) 
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Q Sdieme without subcyding - dt=4xCFL 


nSdjanewdhsnbcyding-dMxCFL 



Fig. 8. Computed non-dimensional fluid pressure (At = 4xCFL and with subcycling) 
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5. A family of explicit/implicit fluid/structure partitioned procedures 

Next, we consider the case where an explicit scheme is desired for advanc- 
ing the fluid field. Consequently, we cannot design an unconditionally stable 
staggered algorithm for solving the coupled fluid/structure problem, because in 
that case it would be only conditionally consistent [21]. Rather, we focus on 
developing a family of partitioned procedures whose stability limit is governed 
by the critical time-step of the fluid solver. In other words, we wish to design 
a staggered solution algorithm for the coupled problem whose stability limit is 
not worse than that of the underlying fluid explicit time-integrator. This is not 
necessarily a trivial task because coupling effects can restrict the stability limits 
of the independent field time-integrators. 

5.1. A predictor- corrector approach 

Here, we present a family of explicit/implicit staggered algorithms for solving 
the fluid / structure Eqs. (24) that is based on similar ideas to those presented in 
Section 4: 

Step Ell. Predict the structural field using the value computed at t n — nAt: 

Q p = Q n 

Step EI2. Predict the fluid system using the forward Euler explicit scheme: 

SW P = 8W n + AtA8W n + AtBQ p 

Step EI3. Improve the structured field using an implicit generalized trapezoidal 
method and the predicted fluid state 8W P : 

= Q n + A tDQ^ + AtCSW p a 6 ]0, 1] 

Step EI4. Correct the expressions yielding the fluid and structured fields to en- 
hance the stability of the explicit/implicit staggered procedure: 

8W n+ 1 = 6W p + [6W n+ie ) 

Q n+1 = QZ+T+[Q n+lC ] 

When applied to the solution of the uncoupled linearized fluid system (Eqs. 
(17)), the forward Euler algorithm is first-order eiccurate and stable for CFL < 1. 
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For a > 1/2, the generalized trapezoidal family of methods is unconditionally 
stable when applied to time-integrate the uncoupled structural problem. For 
a e ]0, 1], these methods are first-order accurate, except for a = 1/2, in which 
case the corresponding scheme is second-order accurate. The correction terms 
[£W n+1 ‘] and [Q n+lC ] should be computed to enhance the stability of the resulting 
explicit/implicit staggered solution procedure. 

THEOREM 9. For a > 1/2, [8W n +' e ] = aAtB(Q n+1 - Q n ) and [Q n+1 ‘] = 
—AtC(AtBQ n +6W n+lC )/2, the stability of the explicit/implicit staggered time- 
integrator defined by Steps EI1-EI4 above is governed by the stability of the 
explicit time-integrator of the uncoupled fluid problem. 

Proof. For [6W" +lC ] = aAtB(Q n+1 - Q n ) and[Q" +lC ] = -AtC(AtBQ n + 
SW n+lC )/ 2, the-proposed explicit /implicit; staggered solution algorithm for solv- 
ing Eqs. (24) becomes: 


Q n+1 = Q n + AtDQ n + a + AtC(8W n + AtA6W n + —BQ n+Q ) 
SW" + ' = 6W " + A tASW" + AtBQ n+a 


(46) 


Let E n be defined as follows: 


E n = ^ Q nt E D Q n 

If the second of Eqs. (46) is re-written as: 

Q n+1 = Q n + AtDQ n+a + AtZ 
where 

Z = C(SW n + AtA6W n + ^-BQ n ) 

£ 

it follows that for a > | we have: 

E n+1 < E n + AtQ n+at E D Z 

Using the above inequality and calculations similar to those in (36), one can 
show that if the energy of the uncoupled fluid problem decreases in time, then 
the energy of the coupled system also decreases in time. Hence, the stability of 
the explicit /implicit staggered procedure defined by Eqs. (46) is governed by the 
CFL condition (CFL < 1) of the explicit Scheme applied to the uncoupled fluid 
problem. 
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THEOREM 10. The explicit/implicit partitioned procedure defined by Eqs. 
(46) is first-order accurate. 

Proof. The proof of this theorem is s imil ar to that of THEOREM 6. 

REMARK 2. Note that while the famil y of implicit/implicit partitioned pro- 
cedures (33) and its fluid-subcycled version (40) require up dat ing the fluid flow 
ahead of the structural field, the family of explicit /implicit staggered algorit hms 
(46) require updating the structural system first. 

5.2. Examples 

Here, we illustrate the numerical properties of the family of explicit/implicit 
staggered procedures, presented in Section 5.1 with the solution of the non- 
dimensional coupled equations (45). We consider the case a = 1/2, and use 
the same non-dimensional parameters and finite volume mesh as in Section 4.3. 

First, we solve the coupled problem for 0.5xCFL < At < l.OxCFL, where 
the CFL is with respect to the uncoupled fluid problem. The obtained non- 
dimensional piston displacement q/l 0 and fluid pressure in the cell in contact 
with the piston (p//>oCo )20 are reported in Fig. 9-10 for the case without subcy- 
cling. Clearly, these results demonstrate numerical stability for At < 1.0 x CFL. 
However, they also show that at time-steps close to the uncoupled fluid CFL con- 
dition, the errors introduced in the initial stages of the computations propagate 
throughout the entire history response of the fluid pressure, but do not affect 
significantly the evaluation of the structural displacement. 

Next, we repeat the previous explicit /implicit computations and subcycle 
the fluid system. We use a subcycling scheme similar to that introduced in Eqs. 
(40) in order to maximize the coupled stability time-step. In that case, the nu- 
merical results reported in Fig. 11 for the non- dimensi onal piston displacement 
q/lo indicate that there exists a maximum subcycling factor beyond which the 
explicit /implicit time-integrator (46) with subcycling becomes numerically un- 
stable. 
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Fig. 11. Computed non-dimensional piston displacement (At = 0.9xCFL and with subcycling) 
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6. Parallel staggering strategies, error analysis, and CPU distribution 

The family of partitioned procedures presented in Sections 4 and 5 are in- 
herently sequential: in the implicit/implicit case, the fluid state vector must be 
updated before the structural system can be advanced, and in the explicit/implicit 
case, the new structural displacements and velocities must be computed before 
the new fluid state vector can be evaluated. With the advent of parallel pro- 
cessors and distributed computing platforms, it becomes interesting not only to 
parallelize each field computations, but also to design staggered time-integration 
algorithms that promote inter-field parallelism — that is, that allow advancing 
simultaneously the fluid and structural systems. In this Section, we present such 
partitioning procedures and discuss their accuracy v.s. speed trade-offs. We con- 
sider only explicit/implicit algorithms. : More specifically, we focus on the case 
where the fluid is advanced using the first-order accurate forward Euler scheme 
and the structure is advanced using the second-order accurate midpoint rule, be- 
cause these algorithms are already available in our large-scale simulation parallel 
software. We use the family of time-integrators (46) as reference, and therefore 
begin with their error analysis. 

We introduce the following nomenclature: 

Of '■ number of floating-point operations in one fluid time-step 

Os : number of floating-point operations in one structural time-step 

Tp : fluid- to-structure single pass transfer time 

Ts : structure-to-fluid single pass transfer time 

CPUf : CPU resource allocated to the fluid kernel 

CPUs • CPU resource- allocated to the structure kernel 

For every partitioned procedure, we give the resource distribution between the 
fluid and structure kernels for simulations on heterogeneous platforms. 

6.1. Algorithm ALGO: the basic staggered scheme 

In the sequel, we refer to the explicit /implicit time-integrator (46) as the 
basic staggered algorithm ALGO. Let Z be defined as follows: 

z = (&) < 47 > 
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Considering the time-interval [t n = nAt, t„+ ns/F = (n + ri 5 /p)At], it follows 
from Eqs. (46-47) that: 


gn+n s / F 


a .(D C \ _l. 2 Af2 ( D2+1 
n s/F At^ B A )+ n s/F—{ BD + 


[1 + 

At 2 

+ ns/F— 


(- 


0 

AB 


C J^j + O (At 3 )]Z" 


DC + CA\ 
BC + A 2 ) 


(48) 


If ERsw and ERq denote respectively the errors in the fluid and structural 
responses after ns/F time-steps, it follows from (48) that: 


ERf& G0 = -n s/F —A6W”+0(At 3 ) 
A/ 2 

ER$ LG0 = n s/F — CAQ n + 0(At 3 ) 


(49) 


which shows that ALGO is first-order accurate. Hence, the accuracy of the struc- 
tural computation is first-order even though the midpoint rule applied to the 
uncoupled structural problem is second-order accurate. 

The basic steps of ALGO axe graphically depicted in Fig. 12. The CPU time 
needed to advanced the coupled solution ns/F^t is equal to: 

Tcoupicd = n S/F(TF + Ts + "gpy— A ( 50 ) 

If the total amount of CPU resources CPU = CPU f + CPUs is assumed to be 
fixed, TcoupUd is minimum for: 


CPU F fo 7 
CPUs ~ V Os 


which gives: 


q^ALGO 

coupled 


n s/F(Tp + Ts + 


Of + Os + 2 \/OfOs 

CPU 


) 


(51) 
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Fig. 12. ALGO: the basic staggered algorithm 

6.2. Algorithm ALGl: subcycling the fluid system 

A fluid-subcycled version of a slightly modified ALGO where the subcycling 
scheme follows the guidelines of THEOREM 7 is referred to as ALGl in the sequel 
and is given by: 


sw n+lW = SW n 

X (0) = Q n 

{ 

For k = 0, tis/f — 1 

£ W n+i<*+ l > = sW n+l(k) + AtA6W n+llk) + AtBQ n 
X (k+l) = X (k) + AtC6W n+l( “ +i) 

} 

6W n+1 = SW n+l{ns/F) 

Q n+1 = Q n+n s/p _ jr(”s/F> + n s/F AtD( — + 5/F ) 


Essentially, the fluid system is subcycled during n§/F time-steps At, and the 
structural field is advanced in one shot a large time-step equal to n S /pAt using 
the average fluid pressure between t n and t n + ns/F . 
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Expanding the various terms in Eqs. (52) around t„ = nAt leads to: 


A ^2 t 

ERi& ai = -n| if—BSW + 0 (A i 3 , n s/F A( 2 ) 

A ^3 

ERq LG 1 = n z s/F —(D 3 Q n +2C6W - 2CBQ n ) + O (At 4 , n 2 s/F At 3 ) 

(53) 

which shows that ALGl is also first-order accurate. However, from (49) and (53), 
it follows that subcycling amplifies the fluid errors by the subcycling factor ns/F- 
Measuring the effect of subcycling on the structural errors of ALGO is less trivial: 
in order to keep its computer implementation simple, we have designed ALGl 
as a fluid-subeycled version of a “slightly, modified” rather them the “original” 
ALGO. Consequently, -the structural errors prow as O (ns/pAt 2 ) in ALGO, and 
as O (n 3 S / F At 3 ) in ALGl. 

The basic steps of ALGl are graphically depicted in Fig. 13. The CPU time 
needed to advanced the coupled solution ns/pAt is equal to: 



For a fixed total amount of CPU resources CPU = CPUp + CPUs, T CO uj>Ud is 

mi nimum for: 

CPUp 
CPUs 

which gives: 


(55) 

The comparison of T^p t ° ed and highlights the computational advantages 

of subcycling. 




« 


37 







Fig. 13. The fluid-subcycled ALG1 staggered algorithm 
6.3. Algorithm ALG2: improving the accuracy of ALGl 

In order to improve the accuracy of the fluid solution in ALGl, we introduce 
a computational phase shift between the fluid and structure kernels equal to 
n5/j?At/2. Assuming that SW n and Q n+ * are available, the improved subcycled 
explicit /implicit algorithm ALG2 computes 8W n + l and Q n+ 2 as follows: 
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Algorit hm ALG2 has the same computational and I/O transfer requirements as 
ALGl. However, its error analysis leads to: 



A direct comparison of (49), (53) and (57) shows that ALG2 offers the computa- 
tional advantages of ALGl, and the higher accuracy of ALGO. 

6 . 4 . Algorithm ALG3: introducing inter-field parallelism 

ALGO, ALGl and ALG2 are inherently sequential. In all three algorithms, 
the fluid system must be updated before the structural system can be advanced. 
The following explicit/implicit fluid-subcycled time-integrator ALG3 introduces 
inter-field parallelism in the solution of Eqs. (24): 

SW n+i (0) _ 6W n 

{ 

For k = 0, tis/f — 1 

5W rn+i ( *+ 1) = sW n+l(k) +AtA5W n + lW +AtBQ n 

} 

6W n+1 = SW n+lins,F) 

O n 4* O n + n s/F 

Q n+ 1 = Q n+n *i* = Q n + n s/F AtCSW n + n s/F A*L>( * — ) 

(58) 

Clearly, the fluid and structure kernels can run in parallel during the time- 
interval [< n , t n + ns/F ]. Inter-field co mmuni cation or I/O transfer is needed only 
at the beginning of a time-interval. Expanding the various terms in Eqs. (58) 
around t n = nAt leads to: 



which demonstrates that ALG3 is first-order accurate. However, the above error 
analysis also shows that parallelism in ALG3 is achieved at the expense of ampli- 
fied errors in both the fluid (a factor equal to ns/F with respect to ALG2) and 
structural (a factor equal to l/(ns/pAt) with respect to ALG2) systems. 
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The basic steps of ALG3 are graphically depicted in Fig. 14. The parallel 
CPU tim e needed to advanced the coupled solution ns/F&t is equal to: 


I'coupied — If "I” Ts max( 


n S/F@F Os , 

CPUf ’ CPUs' 


For a fixed total amount of CPU resources — for example, a fixed number of 
processors in a parallel machine — T CO upUd is minimum for: 

CPUf _ tis/fOf 
'CPUs ~ Os 

Hence, the parallel CPU time associated with.ALG3 is: 



which demonstrates the computational advantages of this parallel scheme. 



Fig. 14. The basic parallel subcycled ALG3 algorithm 


« 
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6.5. Algorithm ALG4: improving the accuracy of ALG3 

In order to improve the accuracy of the basic parallel time-integrator ALG3, 
we propose to exchange information between the fluid and structure kernels at 
half-step in the following specific maimer (ALG4): 


<5W n+l(0> = 6W n 


i 

Fork = 0, 

6W n+ i (t+1) = 6W n+l(k) + AtA6W n+l(k) + AtBQ n 


} 

6W n+ ± 




O n + O n+1 

gn+1 = Qn + ns/F AtC6W n + n s , F AtD( ? -— ? - ) 

} 

For k = n S /F — 1 

6W n+1 (t+1) = sW n+lW +AtA6W n+l(k) +AtBQ^ 

} 

6W n+ 1 = 5TF n+l( ” s/F) 

<3" +1 = Q n + ns/FAtCSW+i + n 5/F AtX>( 9 " + g ) 


(62) 


The above algorithm ALG4 is illustrated in Fig. 15. The first-half of the com- 
putations is identical to that of ALG3, except that the fluid system is subcycled 
only up to t n | "s/f , while the structure is advanced in one shot up to t n+ns/F . 
At t . « 5 /f , the fluid and structure kernels exchange pressure, displacement and 

” i 2 

velocity information. In the second-half of the computations, the fluid system is 
subcycled from t n | " S /f to t n+nsfF using the new structural information, and the 


structural behavior is re-computed in parallel using the newly received pr essure 
distribution. Note that the first evaluation of the structural state vector Q n+1 
can be interpreted as a predictor. 
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Fig. 15. The improved parallel subcycled ALG4 algorithm 
An error analysis of ALG4 reveals that: 


ERfw* 4 — 0 (ns/F&t 2 ) 
ER% LGa = 0{n% fF At 3 ) 


which shows that this parallel algorit hm has the same accuracy as the improved 
ALG2. 

The parallel CPU time needed to advanced the coupled solution ns^pAt 
using ALG4 is equal to: 

n S/f Q 

= 2(Tr+Ts + ^gL.)) (64) 


For a fixed total amount of CPU resources, this parallel time is minimum for: 

CPUf us/fOf 
< CPU s = 2 Os 

Hence, the parallel CPU time corresponding to ALG4 is: 


T££, U = 2 (T F + T S ) + 


ks/fOf + 2 0 a 

CPU 


(65) 
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In summary, the accuracy of the basic parallel algorithm ALG3 is improved at the 
expense of an additional communication step or I/O transfer during each coupled 
cycle. 

6.6. Applications 

The advantages and limitations of ALGO — ALG4 are summarized in Fig. 16 
which contrasts the various computed solutions q/lo of the non- dimensional Eqs. 
(45), using the same non-dimensional parameters and finite volume mesh as in 
Section 4.3, and A = 0.9xCFL. 


ALC0-<M9lkCa ALGl-dMUfeCR 



Fig. 16. Computed non-dimensional piston displacement (ALGO — ALG4) 
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If the ALGO solution is used as reference, the reader can observe that ALGl is 
at the stability limit when ns/F =5. The accuracy of ALG3 is comparable to 
that of ALGl, but its stability is restricted to ns/F — 2. On the other hand, the 
parallel algorithm ALG4 is shown to have the same accuracy as ALGO even for 
ns/F = 20, which highlights the merits of this improved parallel algorithm. 


7. Aeroelastic response of a flexible panel in transonic nonlinear regime 

Based on the insight gained from the analysis and solution of the coupled 
piston problem, we have extended the algorithms presented in this paper to the 
solution of the three-field coupled formulations summarized in Eqs. (1), and com- 
plex aeroelastic-problems. The generalization of ALGO — ALG4, their implemen- 
tation on heterogeneous parallel processors, and the analysis of their performance 
results are discussed in details in a companion paper [22]. Here, we focus on vali- 
dating qualitatively the conclusions drawn from the mathematical and numerical 
investigations of the model problem with the simulation of the two-dimensional 
transient aeroelastic response of a flexible panel in transonic nonlinear regime. 

For a two-dimensional simulation, the panel is represented by its cross section 
that is assumed to have a unit length L — 1, a uniform thickness h = 10” 2 x L, 
and to be clamped at both ends. This rectangular cross section is discretized 
into 300 x 3 plane strain 4-node elements with perfect aspect ratio to avoid mesh 
locking. This fine discretization — which generates 1204 nodes — is not needed 
for accuracy; we have designed this structural mesh only because we were also 
interested in assessing some computational and I/O performance issues. The 
two-dimensional flow domain around the panel is discretized into 2880 triangles 
and 1504 vertices. The free stream Mach number is set to = 0.8, and a 
slip condition is imposed at the fluid/structure boundary. Because the fluid and 
structural meshes are not compatible at their interfaces (Fig. 17), the “Matcher” 
software [23] is used to transfer the pressure load to the structure, and to transmit 
the structural deformations at the surfaces of the panel to the fluid. 

Initially, a steady-state flow is computed around the panel at = 0.8 (Fig. 
18). Next, this flow is perturbed via an initial displacement of the panel that is 
proportional to its second fundamental mode (Fig. 19), and the subsequent panel 
motion and flow evolution are computed using the ALGO — ALG4 explicit /implicit 
fluid/structure time-integrators (Fig. 20-21). 
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Fig. 17. A partial view of the structure and fluid discretizations 

More specifically, the dynamic equations of equilibrium of the structure are 
solved via the parallel implicit transient FETI method [24], with the improve- 
ments proposed in [25] for the efficient iterative solution of systems with repeated 
right hand sides. The Euler flow equations are solved with a parallel algorithm 
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that combines a second-order accurate Monotonic Up winding Scheme for Con- 
servation Laws for spatial approximation, and a second-order low-storage explicit 
Runge-Kutta scheme for time-integration [26]. All computations are carried out 
on an iPSC-860 parallel processor. Four processors are allocated to the fluid 
code, and two processors to the structural program. The fluid and structural 
computations are implemented in a heterogeneous manner using the intercube 
communication procedures described in [27]. 


Fig. 18. Pressure isovalues of the steady-state flow solution 
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Fig. 20. Pressure isovalues at t = 8.75 X 10“ 5 



\ S, 

\ 

X \ 

Fig. 21. Pressure isovalues at t = 1.25 X 10~ 4 
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For the uncoupled fluid problem, the CFL stability time-step is At = 1 xCFL 
= 6.25 x 1(T 6 . The lift solutions computed by ALGO (At c = At F = At$ = 6.25 x 
10" 6 ), ALG1 (A t F = 6.25 x 10~ 6 , At s = 5 x 10~ 5 , n s/F = 8), ALG3 (A t F = 
6.25 x 10“ 6 , At s = 5 x 10~ 5 , n s/F = 8), and ALG4 (A t F = 6.25 x 10~ 6 , A t s = 
5 x 10 -5 , ns/F = 8) axe depicted in Fig. 22. Clearly, as predicted by the theory 
presented in this paper, all proposed explicit/implicit time- integrators are shown 
to be numerically stable at A t F = lx CFL and ns/F = 8, and ALG4 is reported 
to improve the accuracy of ALG3. Note also that the parallel algorithm ALG4 
is shown to achieve in practice a better accuracy than the sequential algorithm 
ALGl for the same time-steps A t F = lx CFL and Ats = 8x CFL. 



Time 

Fig. 22. Computed lifts (ALGO, ALGl, ALG3, ALG4) 

The superiority of ALG4 over ALG3 is also illustrated in Fig. 23 which 
shows that the lift solution computed by ALG4 with Ats = 16 x CFL is less 
oscillating than that computed by ALG3 with a smellier Ats = 8x CFL. While 
both computed lift solutions may not be accurate enough for structural analysis 
purposes, ALG3 and ALG4 quickly (faster than ALGO) and correctly reproduce 
the overall aeroelastic behavior of the system — for example, they show that 
flutter is not occuring — which is what a designer is mostly interested in verifying 
initially. 
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Fig. 23. Computed lifts (ALGO, ALG3, ALG4) 


8. Closure 

In this paper, we have presented several partitioned procedures for time- 
integrating the transient coupled aeroelastic problem, and have discussed their 
merits in terms of accuracy, stability, heterogeneous computing, I/O transfers, 
subcycling, and parallel processing. All theoretical results have been derived for a 
one-dimensional piston model problem with a compressible flow, because the com- 
plete three-dimensional aeroelastic problem is difficult to analyze mathematically. 
However, the insight gained from the analysis of the coupled piston problem and 
the conclusions drawn from its numerical investigation have been confirmed with 
the numerical simulation of the two-dimensional transient aeroelastic response 
of a flexible panel in a transonic nonlinear Euler flow regime. In particular, we 
hope that with the unconditionally stable implicit-implicit staggered procedure 
and the parallel coupling strategy with superior accuracy properties presented in 
this paper, large-scale transient aeroelastic computations will be finally feasible. 
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Abstract 

Here we report on our effort in simulating unsteady viscous flows on the iPSC-860, the 
CM-5, and the KSR-1 MPPs (Massively Parallel Processors), using a Monotonic Upwind 
Scheme for Conservation Laws finite volume/finite element method on fully unstructured 
fixed and moving grids. We advocate mesh partitioning with message passing as a portable 
paradigm for parallel processing. We present and discuss several performance results ob- 
tained on all three MPP systems in terms of interprocessor communication costs, I/O, 
scalability, and sheer performance. 
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1 Introduction 

In this paper, we detail our approach to the simulation of large scale, 
steady and unsteady, compressible viscous flows on massively parallel 
processors. We consider the numerical solution of the two-dimensional 
Navier-Stokes equations using a mixed finite element/finite volume 
formulation based on unstructured triangular meshes. The spatial 
approximation method combines a Galerkin centered approximation 
for the viscous terms, and a Roe upwind scheme for the computation 
of the convective fluxes. Higher order accuracy is achieved through 
the use of a piecewise linear interpolation method that follows the 
principle of the MUSCL (Monotonic Upwind Scheme for Conservative 
Laws) procedure. The temporal solution is carried out via a 3-step 
variant of the explicit Runge-Kutta method which lends itself to paral- 
lel processing. An ALE (Arbitrary Lagrangian Eulerian) formulation 
is incorporated in the fluid solver to allow the grid points to displace 
in Lagrangian fashion, or be held fixed in Eulerian manner, or be 
moved in some specified way to give a continuous and automatic re- 
zoning capability, depending on the needs of the physical problem to 
be solved. 

Explicit solvers are naturally amenable to parallel processing be- 
cause they essentially involve local computations on vertices, and/ or 
edges, and/or triangles of a mesh. However, unstructured meshes in- 
duce indirect addressing memory operations that are costly on many 
hardware architectures. In particular, the present mixed finite ele- 
ment/finite volume solver incurs multiple gather/scatter operations 
between vertex and triangle based arrays. Therefore, in this paper we 
highlight the impact of irregular data access patterns on the compu- 
tational scalability of a parallel unstructured solver, and emphasize 
the importance of data locality in achieving high level performances. 
These concerns and our drive for developing a portable code have led 
us to adopt mesh partitioning with message-passing as a paradigm for 
parallel processing. 

The remainder of this paper is organized as follows. Section 2 
describes the mathematical model of the problem and the approxima- 
tion methods involved in the numerical solution algorithm. Section 3 
identifies the main computational kernels, and motivates the selected 



1 INTRODUCTION 


2 


parallelization strategy. Overlapping and non-overlapping mesh parti- 
tions are presented, discussed, and contrasted. Finally, Sections 4 and 
5 report and analyze the performance results obtained on the iPSC- 
860, the KSR-1, and the CM-5 parallel processors for various external 
and internal viscous flow simulations, with fixed and moving meshes. 
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2 Simulation of compressible viscous 
flows 

We are interested in the numerical simulation of two-dimensional com- 
pressible viscous flows around or within, fixed, or moving and deform- 
ing bodies. Here, we overview the spatial and temporal discretization 
methods that have been previously detailed in Farhat, Fezoui and 
Lanteri [3] for fixed meshes, and outline the mesh updating procedure 
adopted for aeroelastic computations. 

2.1 Governing equations 

Let Q C IR 2 be the flow domain of interest and T be its boundary. 
The conservative law form of the equations describing two-dimensional 
Navier-Stokes flows is given by : 

jW(x, t ) + V.JF( W(x, t )) = W{x, 0) (1) 

where x and t denote the spatial and temporal variables, and 

r -Id d\ T 
W = (p , /» , pv , E) , V = 

and 

F(W) and G(W) denote the convective fluxes and are given by: 

P u \ P v \ 

F(W) = pu2jrP , G(W)= 

v y puv pv* + p 

\u(E + p)J \v(E + p)/ 

while R(W) and 5(W) denote the diffusive fluxes and are given by: 




R(W) = 


. S(W) = 
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In the above expressions, p is the density, U = (u, v) is the velocity 
vector, E is the total energy per unit of volume, p is the pressure, t 
is the specific internal energy, r xx , r xy , and T n are the components of 
the two-dimensional Cauchy stress tensor, k is the normalized thermal 

conductivity, Re = where po, Uq, Lq and po denote the 


Mo 


characteristic density, velocity, length, and dififusivity is the Reynolds 

number, and Pr = j s the Prandtl number. 

k o 


The velocity, energy, and pressure are related by the equation of 
state for a perfect gas: 

P = (7 - 1)(£ - II 0 II 2 ) 

where 7 is the ratio of specific heats (7 = 1.4 for air), and the specific 
internal energy is related to the temperature via: 

£ = c.r = ^-i||(7|i 2 

p 2 

The components of the Cauchy stress tensor are related to the veloc- 
ities via: 


2 / 

'du 

dv\ 

2 / 

’dv 


f du dv\ 


2 dx 

dy) 

ICO 

II 

1 ? 

dy 

!h) 

’ T ** = ti \M + Tx) 


where p denotes the normalized viscosity. 

In order to account for a potential motion or deformation of the 
computational grid, the following coordinate transformation is intro- 
duced: 


{ f : f i,t) (2) 

Throughout this paper, it is assumed that the Jacobian of the above 
transformation defined as: 


J = det 
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does not vanish at any point and any time. Introducing the grid 
velocity: 


and using Eqs. (2), Eq. (1) can be transformed into the following 
ALE (Arbitrary Lagrangian Eulerian) formulation (see, for example, 
Donea [2]): 


If + JV.f c (W) = (3) 

where : 



and F C (W) and G C (W) are the modified convective fluxes given by: 


F C (W) = 


( pu \ 


( pv \ 

pun + p 

, G C (W) = 

puv 

puv 

y Eu + up ) 

pvv + P 
{ Ev + pv j 


and: 


{ u = u — w x 

V — V — Wy 


(4) 


2.2 Boundary conditions 

The boundary T(t) of the flow domain is partitioned into a wail bound- 
ary r w (<) and an infinity boundary ^(t): T(t) = I\„(t) U TooW- Let 
n(t) denote the outward unit normal at any point of T(t), and U w and 
T w denote the wail velocity and temperature. 


On the wall boundary r,u(t), a no-slip condition and a Dirichlet 
condition on the temperature are imposed: 

U = U W , T = T W (5) 

No boundary condition is specified for the density. Hence, the total 
energy per unit of volume and the pressure on the wall are given by : 
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p = (7 - 1 )pC v T w , E = pC v T w + || U w || 2 (6) 

For external flows around airfoils, the viscous effects are assumed 
to be negligible at infinity, so that a uniform free-stream state vector 
Woo is imposed on roo(i): 



where a is the angle of attack, and Moo is the free-stream Mach num- 
ber. 

For internal flows, TooCO is partitioned into upstream and down- 
stream boundaries which are in general in contact with the wall bound- 
ary: Too(t) = rj£(t) U r^(t). In that case, the previous definition 
of Woo is improved using a parabolic profile for the free-stream veloc- 
ity. For example, for the horizontal flow between two plates shown in 
Figure 1, one can specify: 

Uoo = (0 , Voo (y)f (8) 


y 



Figure 1: Horizontal velocity profile for internal flows 


2.3 Spatial discretization 

The flow domain Cl(t) is assumed to be a polygonal bounded region 
of IR 2 . Let 7 a be a standard triangulation of and h the maximal 
length of the edges of 7 a . A vertex of a triangle T is denoted by 5,-, 
and the set of its neighboring vertices by K(i). At each vertex 5,-, 
a cell Ci(t) is constructed as the union of the subtriangles resulting 
from the subdivision by means of the medians of each triangle of % 
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that is connected to 5,- (see Figure 2). The boundary of C,(t) is 
denoted by dCi(t), and the unit vector of the outward normal to dC,(t) 
by i/i y (t)). The union of all these control volumes 

constitutes a discretization of domain fl(t) : 

T13 

stM = U <?,(<) 

1=1 



Figure 2: Control volume in an unstructured grid 

The spatial discretization method adopted here combines the fol- 
lowing features (see [3] for details): 

• a finite volume upwind approximation method for the convec- 
tive fluxes. Second order spatial accuracy is achieved using an 
extension of Van Leer’s MUSCL technique [13] to unstructured 
meshes; 

• a classical Galerkin finite element centered approximation for the 
diffusive fluxes. 

Let Cf and Cf denote the two representations of C,-(f) in the co- 
ordinate systems defined by x and f , respectively. By definition, Cf 
is a fixed reference representation of C{(t). Integrating Eq. (3) over 

cf yields: 
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Sf^rrk'X+fj J l m 

cf cf cf 

Given that the time derivative is computed for a constant f and that 
Cf does not depend on the time t , the mapping f = £(f, t) and the 
identity dx = Jd£ can be used to transform the left-hand side of Eq. 
(9) into: 


Tt! J wdi+ J / 1-*™* - / / if- * w) (l0) 

Cf Cf Cf 

Finally, integrating Eq. (10) by parts leads to: 


dtl / wa + 

£ J ?c{W).Vidcr 

< 1 > 

Ci( 0 



+ 

J f c (W).Hida 

< 2 > 


dCi(t)rr v (t) 


+ 

J T c {W).Hida 

< 3 > 


dCi(t)nT^(t) 


= 

-iL// S(W 

< 4 > 


T,Si&T J j 


where dCfj(t) = dCf(t)ndCf(t), and N? = N?(x,y) is the PI shape 
function defined at the vertex S,- and associated with the triangle T. 

A first order finite volume discretization of < 1 > goes as follows: 


<1>= A? +1 W? +1 - A?W? + At £ (12) 

l€K(i) 

where A" denotes the area of the control volume Cf measured at 
time t n , Uij(t ) denotes a spatial mean value the normal to dCij(t), the 
tilde superscript designates a temporal mean value between t n and 
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t n+1 , and $jr c denotes a numerical flux function that approximates 
the following quantity: 

t »+i 

A tiMWuWjM* f j f c {W).Uida (13) 

t" dCf } 

Upwinding is introduced by extending Roe’s approximate Riemann 
solver [11] to dynamic meshes and computing $r c as follows: 




?c(Wj) + fcjWj) - 


.Vi 






( Wj - Wj) 


(14) 


dF(W) 

where A is Roe’s mean value of the flux Jacobian matrix — - — , 

oW 

I is the identity matrix, and the dot product w.&ij is computed as 
suggested recently by N’Konga and Guillard [10]: 


W-Vij = i (w(Plij) + v(P2ij)) ■ Vij 


(15) 


where Pu: and Puj are the end points of the bi-segment dCij (see 3). 



Figure 3: Computation of w.Vij 

The control volume area A” +1 is updated using the following finite 
volume scheme: 


A" +1 = A " + At > ^2 w.Vij 


(16) 
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Following the MUSCL technique, second order accuracy is achieved 
in expression (13) via a piecewise linear interpolation of the states Wij 
and Wji at the interface between cells C, and Cj. This requires the 
evaluation of the gradient of the solution at each vertex as follows: 

( Wf = Wf + i(?H0? .S$j 
1 Wf = Wf - 

m 

where W* = (/> , u , v , p) — in other words, the interpolation is 
performed on the physical variables instead of the conservative vari- 
ables. The approximate nodal gradients (VW)f^ are obtained using 
a ^-combination of centered and fully upwind gradients : 

(VVF)f = (1 - PXVW)?™* + p{ywf i vw (18) 

The half-upwind scheme (/ 3 = ^) is simply obtained by means of a lin- 
ear interpolation of the Galerkin gradients computed on each triangle 
of Cii 



Wf 


J f VW\ T dx 
= 2 Ci 


dx 


II 

Ci 

1 E are«(T) £ 


(19) 


orea(C,) 


TtCi 


k=iMT 


and the centered gradient (VWJp" 1 * (/3 = 0) is given by any vector 
that verifies: 


(VW)f mt .SiSj = Wj - w, 


( 20 ) 


The second term < 2 > and the third term < 3 > of Eq. (11) 
include the contributions of the boundary conditions and are evaluated 
as follows: 
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Wall boundary : the no-slip condition is enforced with a strong 
formulation and therefore the corresponding boundary integral in < 
2 > is not explicitly computed. 

Inflow and outflow boundaries : at these boundaries, a precise set 
of compatible exterior data that depend on the flow regime and the 
velocity direction must be specified. Here, a plus-minus flux splitting 
is applied between exterior data and interior values. More specifically, 
the boundary integral < 3 > is evaluated using a non-reflective version 
of the flux-splitting of Steger and Warming [12]: 

J T(W).nida = A + (Wi,i/ ioo ).Wi + A~{W U (21) 

dOi (f)nr o© 


Finally, the viscous integral < 4 > is evaluated using a classical 
Galerkin finite element PI method. The components of the stress 
tensor and those of VNf are constant in each triangle. The velocity 
vector in a triangle is computed as follows: 

Vt = \ £ U“ 

• k=\MT 

Consequently, the viscous fluxes are approximated as follows: 


r ( 

Y K(W).VN?dx= Y aTea (T) I Rt—jt— + s T 
T&erl T&eT V ax 


dNj 

dy 


where Rj and Sj are the constant values of R(W) and 5(14^) in the 
triangle T. 


2.4 Time integration 

The resulting semi-discrete fluid flow equations can be written as: 

*Sr+*m-o 

Because it lends itself to massive parallelism, the following 3-step 
variant of the explicit Runge-Kutta algorithm is selected for time in- 
tegrating the above equations: 
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r PF<°> =W n = W(t = nAt) 

< WW =^W(°) + j± T a k At$(W< i - 1 >) k= 1,2,3 
W(3) = W n+ 1 

( 22 ) 

This scheme is often referred to as a low-storage Runge-Kutta algo- 
rithm because only the solution at substep Ar — 1 is needed to compute 
the one at substep k. The coefficients a k are the standard Runge- 
Kutta coefficients and are given by: 


1 


The above time integration algorithm is third order accurate in the 
linear case, and second order accurate in the general non-linear case. 


2.5 Dynamic meshes 

In this work, an unstructured dynamic fluid mesh is represented by a 
pseudo structural model (see, for example, Batina [1]) where a ficti- 
tious linear spring is associated with each edge connecting two fluid 
grid points 5< and Sj and is attributed the following stiffness: 

kij = / - = (23) 

y/(xj ~ *«) 2 + (30 ~ Vi) 2 

The grid points located on the downstream and upstream boundaries 
are held fixed. The motion of those points located on the wall bound- 
ary is determined from the wall motion and/or deformation. At each 
time step t n+1 , the new position of the interior grid points is deter- 
mined from the solution of a displacement driven pseudo structural 
problem via a two-step iterative procedure. First, the displacements 
of the interior grid points are predicted by extrapolating the previous 
displacements at time steps t n and f n_1 in the following manner: 


6x{ = 26 n X,- — S n ~ 1 X{ 
hi = 2tf n y, - S'-'yi 


( 24 ) 
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with S n x = x n+1 — x n . Next, the above predictions are corrected with 
a few explicit Jacobi relaxations on the static equilibrium equations 
as follows: 



E M x i 
E k <i 

i€ JC(i) _ 

E Mw 

j 6*(0 

E % 

i€AT(.) 


Finally, the new positions are computed as: 


x? +1 = x? + 6 n+1 Xi 

y? +i =y? + S n+1 yi 


(25) 


(26) 
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3 Computational and parallel imple- 
mentation issues 

3.1 Identification of the computational ker- 
nels 

From Eqs. (11-22), it follows that our fluid solver contains essentially 
two kernels of elementary computations, one for the convective fluxes, 
and the other for the diffusive ones. Both type of computations can be 
described as three-step sequences of the form Gather-Compute-Scatter. 

3.1.1 The convective flux kernel 

The evaluation of the second term of < 1 > in Eq. (11) using the 
numerical flux function $jr c (14) with the second order approximation 
outlined in Eq. (17) can be summarized as follows: 

( Hi, = / ? c (W)Mido = ^ 

a<4) < 27 ) 

l Ha = -Ha 

where: 

v*j(t) = J Vida - v\(t) + vi{t) (28) 

9Ca(t) 


Essentially, one-dimensional elementary convective fluxes are com- 
puted at the intersection between the control volumes C,(t) and Cj(t) 
(see Figure 4 below). Each elementary flux contributes to a flux bal- 
ance at the boundary of the control volume C»(t). This balance in- 
volves the accumulation over the set of neighboring vertices K(i) of 
all computed fluxes. From the second of Eqs. (27), it follows that 
only Hi, needs to be computed in order to update the flux balances 
at the two end-point nodal values of edge Eii = (Si, Sj} . Therefore, 
the most efficient way for evaluating the convective fluxes is to loop 
over the list of the mesh edges and compute as follows: 
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Gl,ij 


Figure 4: Evaluation of a convective flux along an edge {Si, Sj] 


For each edge Eij = Sj} of % Do 

Gather W, = W(5i) , Wi = W(S;) 

Gather VW; = W(5.) , VWi = VW(Si) 

Compute Hij 

Scatter $, = $,• + Rij 

Scatter $_, = — H{j 

End Do 


3.1.2 The diffusive flux and nodal gradient kernel 

the last term < 4 > of Eq. (11), the elementary diffusive flux 
i(T) is constant in each triangle T. Its evaluation requires accessing 
the values of the physical state W at the three vertices Si , Sj and 
: 



= JJ ti(W).VN? dxdy 

T 


= area(T)(RT 


dN? 

8x 


+ St 


dNl 
dy ] 


(29) 


The values of Rj and St contribute tp the diffusive fluxes at all three 
vertices of triangle T. The sum symbol in < 4 > is a dear indication 
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of a gather operation. Clearly, the most efficient way for evaluating 
the convective fluxes is to loop over the list of the mesh triangles and 
compute as follows: 

For each element Tij k = {S,-,5j,5fc} of T k Do 

Gather W { = W(Si) , Wj = W(5j) , W k = 

W(S k ) 

Compute Rt,St 
Scatter V,- = V,- + 7£,(T) 

Scatter Vj = Vj + Rj{T) 

Scatter V k = V* + TZ k (T) 

End Do 

l 

The evaluation of the half-upwind nodal gradient (19) follows the 
same computational pattern described above. 


3.2 The mesh partitioning with message-passing 
parallel paradigm 

In addition to efficiency and parallel scalability, portability should be 
a major concern. With the proliferation of computer architectures, 
it is essential to adopt a programming model that does not require 
rewriting thousands of lines of code — or even worse, altering the 
architectural foundations of a code — every time a new parallel pro- 
cessor emerges. Here, we are neither referring to differences between 
programming languages, nor to differences between the multitude of 
parallel extensions to a specific programming language. We are more 
concerned about the impact of a given parallel hardware architecture 
on the software design, and sometimes, on the solution algorithm it- 
self. For example, a data parallel code written for the CM-2 or CM-5 
machines could require major rehauling before it can be adapted to an 
iPSC computer. A parallel-do-loop based code can be easily ported 
across different true shared memory multiprocessors, but may require 
substantial modifications before it can run successfully on some dis- 
tributed memory systems. 

Based on our “hands on” experience with a dozen of different par- 
allel processors, we believe that the fnesh partitioning and message- 
passing lead to portable software designs for parallel computational 
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mechanics. Essentially, the underlying mesh is assumed to be parti- 
tioned into several submeshes, each defining a subdomain. The same 
“old” serial code can be executed within every subdomain. The as- 
sembly of the subdomain results can be implemented in a separate 
software module and optimized for a given machine. This approach 
enforces data locality, and therefore is suitable for all parallel hardware 
architectures. For example, we have shown in [8] that for unstructured 
meshes, this approach produces substantially better performance re- 
sults on the KSR-1 than the acclaimed virtual shared memory pro- 
gramming model. Note that in this context, message-passing refers 
to the assembly phase of the subdomain results. However, it does 
not imply that messages have to be explicitly exchanged between the 
subdomains. For example, message-passing can be implemented on a 
shared memory multiprocessor as a simple access to a shared buffer, 
or as a duplication of one buffer into another one. 

In this work, we use essentially the same code on the iPSC-860, 
the KSR-1, and the CM-5 parallel processors. This code also runs 
on a workstation. We consider mesh partitions with and without 
overlapping for reasons that are discussed next. 


3.2.1 Overlapping mesh partitions 

The reader can verify that for the computations described herein, 
mesh partitions with overlapping simplify the programming of the sub- 
domain interfacing module. Only one communication step is required, 
after the local physical states have been updated. Depending on the 
order of the spatial approximation, the overlapping region can be one 
or three triangles wide (see Figure 5 below). For a first order spatial 
approximation, we have by definition: 


< 

w* = w* 


(30) 


which shows that the overlapping region needs in that case to be only 
one triangle wide. 

However, mesh partitions with overlapping also have a drawback: 
they incur redundant floating-point operations. 
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Figure 5: Overlapping mesh partition 
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For fixed meshes and overlapping partitions, the main loop of the 
parallel fluid solver described herein goes as follows: 

Repeat step = step + 1 

Compute the local time steps 
For srk = 1 to nsrk Do 

Compute the nodal gradients 
Compute the diffusive fluxes 
Compute the convective fluxes 
Update the physical states 
Exchange the conservatives variables 

End Do 

Until step = step max 

In the above pseudo code, step max denotes the maximum number of 
time steps, and nsrk denotes the number of steps in the Runge-Kutta 
integration algorithm. 

All overlapping mesh partitions used in this investigation were gen- 
erated by the decomposer described in [9]. 

3.2.2 Non-overlapping mesh partitions 

Non-overlapping mesh partitions (see Figure 6 below) incur little re- 
dundant floating-point operations but induce one additional communi- 
cation step. While physical state variables are exchanged between the 
subdomains in overlapping mesh partitions, partially gathered nodal 
gradients and partially gathered fluxes are exchanged between subdo- 
mains in non-overlapping ones. 
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Figure 6: Non-overlapping mesh partition 

For fixed meshes and non-overlapping partitions, the main loop of 
the parallel fluid solver described herein goes as follows: 

Repeat step = step + 1 

Compute the local time steps 
For srk = 1 to nsrk Do 

Compute the nodal gradients 
Compute the diffusive fluxes 
Exchange the nodal gradients 
Compute the convective fluxes 
Exchange the convective fluxes 
Update the physical states 

End Do 

Until step = step max 


3 COMPUTATIONAL AND PARALLEL IMPLEMENTATION ISSUES21 


All non-overlapping mesh partitions discussed in this investigation 
were generated by the TOP/DOMDEC software described in [4], 

3.2.3 To overlap or not to overlap? 

To answer this question, we analyze the communication requirements 
of both families of mesh partitions, and the amount of redundant com- 
putations they incur for two-dimensional fixed problems. Because we 
are interested in a comparative study, it suffices to consider the case 
of a single interface between two subdomains with uniform triangula- 
tions. 

Let n'J° v denote the number of interface vertices in a non overlap- 
ping mesh partition (see Figure 7). In the first communication step, 
8 X n" ov words related to the nodal gradients are exchanged between 
the two subdomains. In the second communication step, 4xn” ov fluxes 
are exchanged. Hence, the total communication cost per subdomain 
is given by: 


T™ = 2 x T s + 12 x n? ov x T r (31) 

where T, denotes the startup time of a message, and T r denotes the 
transmit time for a 64-bit word. 

In a non-overlapping mesh partition, the only redundant computa- 
tions are those associated with the evaluation of the convective fluxes 
along the interface edges. Since an elementary convective flux requires 
about 200 floating-point operations, the total time per subdomain as- 
sociated with redundant computations can be estimated as: 

= 200 x (n" ov - 1) x T a (32) 

where T a denotes the times it takes to perform a single floating-point 
operation. 

From Figure 8, it follows that the total number of overlapping 
vertices in an overlapping mesh partition is 4 x n T } ov . In this case, 
only one communication step is required to exchange 4 components of 
the physical state at half of the overlapping vertices. Hence, the total 
amount of communication per subdorhain is given by: 
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D1 D2 



Figure 7: Analysis of a non-overlapping mesh partition 
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Dl i D2 



Cl vertices updated by Dl and communicated to D2 


O vertices updated by D2 and communicated to Dl 


Figure 8: Analysis of an overlapping mesh partition 
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C = T, + 8x x T r (33) 

In an overlapping mesh partition, redundant computations are 
performed during the evaluation of both the convective and diffu- 
sive fluxes. Given that an elementary diffusive flux requires about 
100 floating-point operations, the total time per subdomain associ- 
ated with redudant computations is given in that case by: 


T™ d = (200 x (4 x rtf 0 *) + 100 x (4 x (n" 0 * — 1)) x T a (34) 

where 4 x n”"*' is the number of overlapping edges that generate redun- 
dant convective flux computations, and 4 x (n” <M> — 1) is the number 
of overlapping triangles that generate redundant diffusive flux compu- 
tations. 

From Eqs. (31,33), it follows that for sufficiently large messages 
we have: 


TZn = T, + 8xnr xT r ^ 2 

T™ 2 x T s + 12 x nf v xT r ~ 3 K ' 

which shows that overlapping the mesh partitions reduces the com- 
munication costs by 33%. 

However from Eqs. (31-34), it follows that for a sufficiently large 
n nov we have: 


(.TZn+T r Z,)-(TZl+T?Z) = lOOOxnrxT.tl-jijX^) (36) 

which suggests that overlapping the mesh partitions will incur a greater 
total parallel overhead than not overlapping them, because of the 
resulting redundant computations. For example on the iPSC-860, 
T a = 0.2 x 10 -6 seconds (sustained 5 Mflops per processor), T r = 
3.184 x 10 -6 seconds (sustained 2.5 Mbytes/second), and (T™ m + 
T™ d ) - (T™ + T%d) ~ 0.18 x 10- 3 x nf > 0. Nevertheless, this 
specific result also suggest that for two-dimensional problems, simi- 
lar performance results will be obtained for mesh partitions with or 
without overlapping. 
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Finally, we caution the reader that different conclusions may be 
drawn for three-dimensional problems where overlapping can also re- 
quire significantly more storage. 
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4 Performance results on a variety of 
MPPs 

In this section, we discuss the parallel performance results obtained 
on various configurations of the iPSC-860, the KSR-1, and the CM-5 
parallel processors. 

4.1 Focus problem 

We consider the numerical simulation of the unsteady viscous flow 
around a fixed NACA0012 airfoil, starting impulsively from a uniform 
flow. The angle of attack is set to 30°, and the free stream Mach 
number to 0.1. Several physical solutions of this problem were previ- 
ously reported in [3] for different Reynolds numbers. All performance 
results reported herein are for 100 iterations and 64-bit arithmetic. 
More importantly, the redundant floating-point operations are 
not counted when evaluating the mflop rate, which is a strict ap- 
proach to benchmarking. 

A partial view of an unstructured triangulation of the computa- 
tional domain is given in Figure 9. Seven meshes with increasing sizes 
have been generated. Their characteristics are summarized in Table 
1 below where Ny denotes the number of vertices, Nt the number of 
triangles, and Ne the number of edges. 


MESH 

N v 

N t 

N e 

Ml 

8119 

15998 

24117 

M2 

16116 

30936 

47052 

MS 

32236 

63992 

96228 

M4 

63974 

127276 

191250 

M5 

131035 

261126 

392161 

M6 

262717 

523914 

786631 

Ml 

523196 

1044504 

1567700 


Table 1 : seven meshes and their characteristics 


Throughout the remainder of this* paper, the following nomencla- 
ture is used for the investigated mesh partitioning algorithms: 
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Figure 9: Partial view of a NACA0012 mesh 

• SCT : Sector [9] 

• RIB : Recursive inertial bisection [5] 

• GRD : Greedy [5] 

• RGB : Recursive graph bisection [16] 

• RSB : Recursive spectral bisection [16] 


4.2 Parallel scalability for increasing size prob- 
lems 

Parallel scalability is evaluated here for problems where the subdomain 
size is fixed, and the toted size is increased with the number of pro- 
cessors. Note that because we are dealing with unstructured meshes, 
some slight deviations are inevitable. Overlapping mesh partitions are 
generated using the RIB heuristic. 

Tables 2-3 summarize the performance results obtained on the 
iPSC-860 and KSR-1 parallel systems. The number of processors is 
denoted as N p . The parallel CPU time and the Mflop rate are shown 
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to remain almost constant when the problem size is increased with 
the number of processors, which demonstrates the scalability of the 
parallel solver. The slight degradations in efficiency are mainly at- 
tributed to overlapping since redundant operations are not accounted 
for in the evaluation of the mflop rate. The KSR-1 processor cell is a 
RISC-style superscalar 64-bit unit operating at a peak of 40 Mflops. 
Clearly, despite a rather large number of gather/scatter operations, 
25% of this peak performance is attained. 


Nv 


CPU Time 

Mflop/s 

Comm Time 

% Comm 

8119 

i 

491.2 s 

6 

0 s 

0 

16116 

2 

491.7 s 

11 

28.6 s 

5.81 

32236 

4 

514.7 s 

22 

31.3 s 

6.09 

63974 

8 

519.5 s 

43 

40.0 s 

7.71 

131035 

16 

536.6 s 

85 

40.3 s 

7.52 

262717 

32 

548.6 s 

159 

44.8 s 

8.17 


Table 2 : Parallel scalability for increasing size problems 
Computations with overlapping mesh partitions on the iPSC-860 


N v N p 

CPU Time 

Mflop/s 

Comm Time 

% Comm 

8119 

1 

272.3 s 

10 

0 s 

0 

16116 

2 

272.3 s 

20 

1.8 s 

0.67 

32236 

4 

286.0 s 

39 

3.2 s 

1.18 

63974 

8 

289.3 s 

77 

4.3 s 

1.52 

131035 

16 

306.7 s 

149 

5.9 s 

1.95 

262717 

32 

316.1 s 

276 

8.4 s 

2.66 


Table 3 : Parallel scalability for increasing size problems 
Computations with overlapped mesh partitions on the KSR-1 


4.3 Influence of the mesh partitioning algo- 
rithm 

Next, we focus on mesh M6 with 32 processors and non-overlapping 
partitions, and investigate the influence of the partitioning algorithm 
on parallel performance. Tables 4 and 5 report the measured CPU 
time and Mflop rates. “Conv” and “Diff” designate respectively the 
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convective and diffusive fluxes. In all cases, the RSB algorithm yields 
the fastest solution time, even when it does not produce the smallest 
communication time. The reason is that, for mesh M6, the RSB al- 
gorithm does a better job than the others at generating subdomains 
that are well balanced vertex-wise, element-wise, and edge-wise, si- 
multaneously. 


Decomp 

CPU Time 

KS&2!E^^9 

Diff Time 

Comm Time 

Mflop/s 

GRD 

550.0 s 

330.4 s 

163.4 s 

47.4 s 

158 

RGB 

556.3 s 

335.1 s 

157.9 s 

62.7 s 

158 

RSB 

538.0 s 

326.9 s 

153.7 s 

53.8 s 

162 


Table 4 : Influence of the mesh partitioning algorithm 
Computations with non- overlapping mesh partitions on an iPSC-860/32 



CPU Time 


Diff Time 

Comm Time 

Mflop/s 

GRD 

343.7 s 

165.9 s 

112.8 s 

18.9 s 

254 

RGB 

340.1 s 

168.3 s 

105.1 s 

18.5 s 

256 

RSB 

322.5 s 

160.4 s n 

101.4 s 

14.2 s 

270 


Table 5 : Influence of the mesh partitioning algorithm 
Computations with non-overlapping mesh partitions on a KSR-1/32 


The reader can verify that the numbers reported in columns 3 to 5 
of Tables 4-5 do not add up to the total CPU time reported in column 
2. The difference corresponds to various parallel and sequential over- 
heads. On the iPSC-860, these overheads represent less than 2% of the 
total solution time. However on the KSR-1, they represent about 14% 
of the total CPU time. Indeed, the computing mode on the iPSC-860 
is parallel by default, while on the KSR-1 it is sequential by default. 
Hence, many fork- join type of procedures are necessary on the KSR-1, 
which explains the relatively large amount of overhead. 

Also, note that using the same number of processors, the KSR-1 is 
twice as fast as the iPSC-860 at computing the convective fluxes, but 
only 1.5 times faster at computing the diffusive ones. This is because 
the evaluation of the convective fluxes requires less indirect addressing 
than the evaluation of the diffusive ohes. 
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Figure 10: Vertex-wise load bal. - 32 non-overlapping sub. 

GRD algorithm 



Figure 11: Vertex-wise load bal. - 32 non-overlapping sub 

RGB algorithm 
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Figure 12: Vertex-wise load bal. - 32 non-overlapping sub. 

RSB algorithm 
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Figure 13: Element-wise load bed. - 32 non-overlapping sub 

GRD algorithm 
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Figure 14: Element-wise load bal. - 32 non-overlapping sub. 

RGB algorithm 



Figure 15: Element-wise load bal. - 32 non-overlapping sub. 

RSB algorithm 

t 





4 PERFORMANCE RESULTS ON A VARIETY OF MPPS 


33 



Figure 16: Edge-wise load bal. - 32 non-overlapping sub. 

GRD algorithm 



Figure 17: Edge-wise load bal. - 32 non-overlapping sub. 

RGB algorithm 





34 


4 PERFORMANCE RESULTS ON A VARIETY OF MPPS 



Figure 18: Edge- wise load bal. - 32 non-overlapping sub. 

RSB algorithm 


4.4 Parallel scalability for fixed size problems 

Tables 6-8 summarize the performance results obtained on the iP SC- 
860 and KSR-1 parallel processors for overlapping mesh partitions 
generated by the RIB algorithm. For the largest mesh M7, a Gi- 
g allop performance level is attained using 128 processors of the KSR-1 
system. Good scalability is observed on both machines. 


N ? 

CPU Time 

Conv Time 

Diff Time 

Comm Time 

Mflop/s 

32 

548.7 s 

323.1 s 

164.6 s 

44.9 s 

159 

64 

282.7 s 

163.6 s 

82.4 s 

24.4 s 

309 

128 

144.5 s 

82.2 s 

42.7 s 

18.9 s 

617 


Table 6 : Performance results for mesh M6 
Computations with overlapping mesh partitions on the iPSC-860 
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N ? 

CPU Time 

Conv Time 

Diff Time 

Comm Time 

Mflop/s 

32 

316.1 s 

155.4 s 

111.3 s 

8.4 s 

276 

64 

169.7 s 

79.0 s 

58.1 s 

9.8 s 

514 

128 

93.0 s 

39.7 s H 

31.2 s 

6.4 s 

938 


Table 7 : Performance results for mesh M6 
Computations with overlapping mesh partitions on the KSR-1 


N v 

CPU Time 

Conv Time 

Diff Time 

Comm Time 

Mflop/s 

128 

174.0 s 

77.7 s 

60.2 s 

14.0 s 

1024 


Table 8 : Performance results for mesh M 7 
Computations with overlapping mesh partitions on the KSR-1 


Finally, Tables 9-10 compare the performances of the iPSC-860 and 
KSR-1 parallel systems for the case of mesh M6 and non-overlapping 
mesh partitions generated by the GR.D algorithm. For the same 
number of processors, the KSR-1 machine is reported to be 1.67 times 
faster than the iPSC-860, even though its basic processor is supposed 
to be 1.5 times slower than that of the iPSC-860. 


r w 

CPU Time 

Conv Time 

Diff Time 

Comm Time 

Mflop/s 

1 64 

287.7 s 

164.0 s 

79.8 s 

42.3 s 

303 


Table 9 : Performance results for mesh M6 
Computations with non-overlapping mesh partitions on the iPSC-860 


N ? 

CPU Time 

Conv Time 

Diff Time 

Comm Time 

Mflop/s 

64 

171.8 s 

83.1 s 

51.1s 

14.1 s 

508 

128 

94.8 s 

40.1 s 

24.3 s 

15.2 s 

921 


Table 10 : Performance results for mesh M6 
Computations with non-overlapping mesh partitions on the KSR-1 


4.5 Performance results on the CM-5 

Recently, we have implemented our fluid solver on a 32 processor CM- 
5 system using Fortran 77 on a node and the CMMD message pass- 
ing library (this corresponds to the Sparc model of computation on 
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the CM-.:-). For mesh M6 and a 32 subdomain decomposition with 
overlapping using the RJB algorithm, Table 11 reports the measured 
performance results and compares them to those obtained on an iPSC- 
860/32 and a KSR-1/32 computers. 


MPP 

CPU Time 

Conv Time 

Diff Time 

Comm Time 

Mfiop/s 

CM-5 

855.4 s 

541.3 s 

241.7 s 

44.0 s 

102 

iPSC-860 


323.1 s 

164.6 s 

44.8 s 

159 

KSR-1 

316.1 s 

155.4 s 

111.3 s 

8.4 s 

276 


Table 11 : Performance results for mesh M6 
Computations with overlapping mesh partitions 


Clearly, the results reported in Table 11 for the CM-5 are not as 
impressive as those reported, for example, in [15, ?]. This can be 
attributed to several factors including the use of the message-passing 
model for portability reasons, the variety of gather/scatter operations 
required by our specific fluid solver, and more importantly our strict 
approach to performance benchmarking. 

Next, we report performance results on the CM-5 using a global 
CM Fortran approach (this corresponds to the Vector Units model 
of computation on the CM-5), and the parallel version of our solver 
that was previously developed for the CM-2/200 and described in 
[3]. In this so called data parallel approach, all local computations 
are carried out on a control volume. Therefore, all data structures 
are vertex based. Note that this approach generates a substantial 
amount of redundant computations. The reported communication 
timings correspond to the inter and intra vector units gather /scatter 
operations. In Table 12, MFU is a Mflop rate that does not account 
for redundant arithmetic operations, while MFR is a Mflop rate that 
does. 


CPU Time 

Conv Time 

Diff Time 

Comm Time 

MFU 

MFR 

342.0 s 

116.9 s 

49.0 s 

162.0 s 

107 

238 


Table 12 : Performance results for mesh M5 
Computations with overlapping mesh partitions on the CM-5 
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The results reported in Table 12 show that about 50% of the 
elapsed time is spent in gather/scatter operations, which is consistent 
with the results obtained by other investigators for two-dimensional 
finite element fluid problems [7]. Note that this percentage is higher 
than those observed on the iPSC-860 and KSR-1 systems. 
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5 Applications 

5.1 Steady viscous flow inside a model jet en- 
gine 

First, we consider the numerical simulation of a steady viscous flow 
inside a model jet engine. The free stream Mach number and the 
Reynolds number are set respectively to 0.2 and 2000. The com- 
putational grid is illustrated in Figure 19. Its characteristics are 
Nv = 12233 , N t = 22936, and N E = 35170. 

This simulation is carried out on the KSR -1 using overlapping mesh 
partitions generated by the RIB algorithm. Here, the pseudo time 
integration is carried out at CFL=1.9 via a four step Runge-Kutta 
method with ai = 0.11 , 02 = 0.2766 , 03 = 0.5 and 04 = 1.0. A local 
time step strategy is introduced in order to accelerate convergence. 
After 1527 iterations, the initial residual is reduced by a factor of 10 4 . 
The resulting steady mach lines are depicted in Figure 20. 

Table 13 reports the performance results obtained on the KSR- 1 , 
and Table 14 summarizes the characteristics of the generated mesh 
partitions. S(p) denotes the speed-up using p processors. Nj de- 
notes the total number of interface vertices, and Max Nj denotes the 
maximum number of interface vertices per subdomain. The jump of 
co mmuni cation costs between the case with N p = 2 and that with 
N p = 4 can be attributed to the accidental increase in Max Ni, which 
implies an increase in the maximum message length. The slightly 
superlinear speed-up observed for N p = 4 is not uncommon on the 
KSR- 1 . If the problem to be solved does not fit exactly into a single 
processor’s memory, the solution on one processor still involves some 
interprocessor communication that is difficult to time by the user. 


N ? 

CPU Time 

Mflop/s 

Comm Time 

% Comm 

Sip) 

1 

8946 s 

9 

— 

— 

1.0 

2 

4532 s 

17 

28.7 s 

0.63 

1.9 

4 

2136 s 

37 

49.9 s 

2.33 

4.1 

8 

1168 s 

68 

46.3 s 

3.96 

7.6 


Table 13 : Performance results for am internal flow simulation 


Q.-Z- - 
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Computations with overlapping mesh partitions on the KSR-1 


N p 

Min Nv 

Max Nv 

Min Nt 

Max Nj 

Max Nj 

2 

6222 

6229 

11568 

11668 

113 

4 

3205 

3334 

5942 

6183 

171 

8 

1593 

1774 

2964 

3204 

86 


Table 14 : Characteristics of the mesh partitions 


5.2 Airfoil flutter and control surface 

Next, we consider the flutter simulation of a NACA0012 airfoil in tran- 
sonic flow. The structural dynamics behavior of the airfoil is repre- 
sented by a two-spring two-degree of freedom system. The airfoil twist 
6 is associated with a torsional spring and monitors the angle of attack. 
The lateral deflection h is associated with a lineal spring and monitors 
the airfoil bending. The evolution of this system is governed by a set 
of differential equations that can be written in non-dimensional form 
as follows (for example, see [6]): 


(fh xocPe 4^Moq dh 4ulMl - h _ 2MIC, 

dP + 2 dP + V*ug dt + V* 2 u 2 np ^ 

xg d 2 h jj £0 &T&Mqq d& 7 e _ 

k 2 dP + 4 dP V* dt + V* 2 xp 

In Eqs. (37) above, the bar superscript indicates a non-dimensional 
variable, and C| and C m denote respectively the lift coefficient and the 
torsional moment. These two quantities are related to the generalized 
aerodynamics forces Qh and Qg by: 

f Qh =\p~Vl(2b)C, 

[ Qe = 2 P°°V£,( 2 b) 2 C m 


< 


( 38 ) 
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lines : Min = 0.0 , Max — 0.6 , A M = 0.05 
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V 

where 2b is the airfoil chord, V* = -22. j s the normalized velocity, Voo 

buig 

is the free stream velocity, and u = rr is the ratio of the airfoil 

XPoob* 

mass per transversal unit to the free stream density. 


The effect of an additional control surface such as a flap is sim- 
ulated with the superposition of a controlled motion of a fraction of 
the airfoil trailing edge. This secondary motion is defined by the flap 
angle 6(t), which obeys the following control law: 

6(t) = G k h(t)e ivh + G e 6(t)e iv * (39) 

where Gh et Gg axe gain coefficients, and tph. and <Pe phase angles (see 
Figures 21-23). 
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Figure 23: a = 12° , S = 12° , G e = 1.0 
(mesh is coarsened for clarity) 

Two aeroelastic simulations with and without the control surface 
are performed using mesh M4 and: 

2b = 1, m = 1, & = & = 0 

ug — u>h. = 100 rad/s 

ye = 1.865, fi = 60, = —2, xg = 1.8 

V* = 0.6, Q* = 0.006, F oo = 30.0m/s 

The flow initial conditions are identified with the steady solution at 
= 0.8 and zero angle of attack. After the steady state is reached, 
a perturbation in the angle of attack A 9 = 0.01 radian is introduced, 
which causes the airfoil to vibrate and the flow to become unsteady. 
The governing aeroelastic equations (37) are used to predict the dy- 
namic response of the system. The unsteady fluid flow equations are 
time integrated with a global time step strategy. At each time step, 
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Figure 21: a = 0° , 6 = 0°, Ge = 0.0 
(mesh is coarsened for clarity) 



Figure 22: a = 12° , d>'= —6° , Gg = 0.5 
(mesh is coarsened for clarity) 
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the dynamic mesh is updated with 8 explicit Jacobi relaxations as de- 
scribed in Section 2.5. All computations are run on both the iPSC-860 
and KSR-1 parallel processors. 

Figures 24 and 25 report the evolution in time of the angle of at- 
tack 6 and the lift coefficient Cj. Clearly, when the control surface is 
enabled with Gh = G$ = 0.75 and (ph = <p$ = x, a stable aeroelas- 
tic response is observed. When it is disabled, a flutter instability is 
reached. 



Figure 24: Angle of attack 6 (in °) versus physical time t (in seconds) 
Gh = Ge = 0.75 , tph = <p$ = i r 
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Figure 25: Lift coefficient Ci versus physical time t (in seconds) 
Gh, — G$ ~ 0.75 , = <p9 = ft 

: Without active control surface 

: With active control surface 
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For this application, the performance results obtained on the iPSC- 
860 and KSR-1 parallel processors are summarized in Tables 15-16. 
Simulation time is reported for 100 steps and includes sequential I/O 
costs for saving the computed solutions on disk. 


ESI 

Simulation Time 

Flux Comp Time 

Mesh Update 

Comm Time 

16 

372.0 s 

243.2 s 

32.4 s 

33.5 s 

32 

208.0 s 

124.0 s 

17.7 s 

17.1 s 

64 

101.0 s 

64.0 s 

10.1 s 

12.8 s 


Table 15 : 100 steps of an aeroelastic simulation with mesh M4 
Computations with overlapping mesh partitions on the iPSC-860 


ESI 

Simulation Time 

Flux Comp Time 

Mesh Update 

Comm Time 

16 

326.0 s 

135.0 s 

35.8 s 

18.8 s 

32 

176.0 s 

69.2 s 

19.1 s 

18.9 s 

64 

99.7 s 

35.1 s 

11.0 s 

15.8 s 


Table 16 : 100 steps of an aeroelastic simulation with mesh M4 
Computations with overlapping mesh partitions on the KSR-1 


Both parallel processors are shown to deliver good speed-ups. In- 
terprocessor communication time varies between 9% and 12.7 % on 
the iPSC-860, and 5.7% and 15.8% on the KSR-1. The cost of up- 
dating the dynamic mesh is about 10% of the total cost only. Note 
that for this simulation, the KSR-1 is not reported to be 1.5 times 
faster than the iPSC-860 for the same number of processors, unlike 
in all previous cases. This suggests that I/O on the KSR-1 is more 
expensive than on the iPSC-860. 
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